Projet 7 : Implémentez un modèle de scoring

Partie 1 : Preprocessing and analyse

Pour ce projet, je suis data scientist pour une entreprise qui propose des crédits. L'entreprise souhaite développer un modèle de scoring de la probabilité de défaut du client et l'associer à un tableau de bord interactif afin que les responsables de la relation client puissent expliquer avec la plus grande transparence les décisions d'octroi ou non d'un crédit.

Dans ce notebook, j'effectue le prétraitement des données et une exploration de celles-ci.

Setting up the work environment

Library

In [ ]:
# General
# File system management
import os
import glob

# Visualisation
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
import pandas as pd
import seaborn as sns
import missingno as msno

import math
import scipy
import scipy.stats as stats
from scipy.stats import variation

import collections
from collections import Counter

from termcolor import colored

from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

%matplotlib inline

Parameters

In [ ]:
# Format & option
sns.set(rc={"figure.figsize": (16, 9)})
pd.options.display.max_columns = 150

# Style use
sns.set_style("darkgrid")
plt.style.use("ggplot")
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
%matplotlib inline

# Suppress warnings
import warnings

warnings.filterwarnings("ignore")

Importing data

Les données sont fournies par Home Credit, un service dédié à la fourniture de lignes de crédit (prêts) à la population non bancarisée. Prédire si un client remboursera ou non un prêt ou s'il aura des difficultés est un besoin commercial essentiel.

In [ ]:
"""from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/My\Drive/Data_projet_OC

print(
    os.listdir(
        r"/content/drive/MyDrive/Data_projet_OC/"
    )
)"""
Out[ ]:
'from google.colab import drive\ndrive.mount(\'/content/drive\')\n\n%cd /content/drive/My\\Drive/Data_projet_OC\n\nprint(\n    os.listdir(\n        r"/content/drive/MyDrive/Data_projet_OC/"\n    )\n)'
In [ ]:
#List of files.
print(
    os.listdir(
        r"/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data"
    )
)
['application_test.csv', '.DS_Store', 'HomeCredit_columns_description.csv', 'POS_CASH_balance.csv', 'credit_card_balance.csv', 'installments_payments.csv', 'application_train.csv', 'bureau.csv', 'previous_application.csv', 'bureau_balance.csv', 'sample_submission.csv']

Il y a un total de 9 fichiers : 1 fichier principal pour l'entraînement (avec la cible) 1 fichier principal pour le test (sans la cible), 1 fichier de description, et 6 autres fichiers contenant des informations supplémentaires sur chaque prêt.

application_{train|test}.csv les principales données de formation et de test contenant des informations sur chaque demande de prêt chez Home Credit. Chaque prêt a sa propre ligne et est identifié par la caractéristique SK_ID_CURR. Les données de demande d'entraînement sont accompagnées de la caractéristique TARGET indiquant 0 : le prêt a été remboursé ou 1 : le prêt n'a pas été remboursé.

bureau.csv données concernant les crédits antérieurs du client auprès d'autres institutions financières. Chaque crédit précédent a sa propre ligne dans bureau, mais un prêt dans les données de la demande peut avoir plusieurs crédits précédents.

bureau_balance.csv données mensuelles concernant les crédits précédents dans le bureau. Chaque ligne correspond à un mois de crédit antérieur, et un crédit antérieur unique peut avoir plusieurs lignes, une pour chaque mois de la durée du crédit.

POS_CASH_balance.csv données mensuelles sur les prêts au point de vente ou au comptant que les clients ont eu avec Home Credit. Chaque ligne correspond à un mois d'un prêt au point de vente ou d'un prêt en espèces précédent, et un seul prêt précédent peut avoir plusieurs lignes.

credit_card_balance.csv données mensuelles sur les cartes de crédit que les clients ont eu avec Home Credit. Chaque ligne correspond à un mois de solde de carte de crédit, et une seule carte de crédit peut avoir plusieurs lignes.

previous_application.csv demandes précédentes de prêts au Home Credit des clients qui ont des prêts dans les données de demande. Chaque prêt actuel dans les données de la demande peut avoir plusieurs prêts précédents. Chaque demande précédente a une ligne et est identifiée par la caractéristique SK_ID_PREV.

installments_payments.csv historique des paiements pour les prêts précédents chez Home Credit. Il y a une ligne pour chaque paiement effectué et une ligne pour chaque paiement manqué.

HomeCredit_columns_description.csv Ce fichier contient les descriptions des colonnes des différents fichiers de données.

Existing relationships between the different files

68747470733a2f2f73746f726167652e676f6f676c65617069732e636f6d2f6b6167676c652d6d656469612f636f6d7065746974696f6e732f686f6d652d6372656469742f686f6d655f6372656469742e706e67.png

Train Test

In [ ]:
default_dir = (
    "/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data"
)


"""default_dir = (
"/content/drive/MyDrive/Data_projet_OC/")"""


app_train = pd.read_csv(os.path.join(default_dir, "application_train.csv"))
app_test = pd.read_csv(os.path.join(default_dir, "application_test.csv"))

print(f"Training Data Shape: {app_train.shape}")
print(f"Testing Data Shape: {app_test.shape}")

app_train.head()
Training Data Shape: (307511, 122)
Testing Data Shape: (48744, 121)
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0.0 0.0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0.0 0.0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

Le CSV train contient la cible, contrairement au CSV test.

In [ ]:
def get_balance_data():
    pos_dtype = {
        "SK_ID_PREV": np.uint64, "SK_ID_CURR": np.uint64, "MONTHS_BALANCE": np.int64, "SK_DPD": np.int64, 
        "SK_DPD_DEF": np.int64, "CNT_INSTALMENT": np.float64, "CNT_INSTALMENT_FUTURE": np.float64,
    }

    install_dtype = {
        "SK_ID_PREV": np.uint64, "SK_ID_CURR": np.uint64, "NUM_INSTALMENT_NUMBER": np.int64, "NUM_INSTALMENT_VERSION": np.float64,
        "DAYS_INSTALMENT": np.float64, "DAYS_ENTRY_PAYMENT": np.float64, "AMT_INSTALMENT": np.float64, "AMT_PAYMENT": np.float64
    }

    card_dtype = {
        "SK_ID_PREV": np.uint64, "SK_ID_CURR": np.uint64, "MONTHS_BALANCE": np.int64, "AMT_CREDIT_LIMIT_ACTUAL": np.int64,
        "CNT_DRAWINGS_CURRENT": np.int64, "SK_DPD": np.int64, "SK_DPD_DEF": np.int64, "AMT_BALANCE": np.float64,
        "AMT_DRAWINGS_ATM_CURRENT": np.float64, "AMT_DRAWINGS_CURRENT": np.float64, "AMT_DRAWINGS_OTHER_CURRENT": np.float64,
        "AMT_DRAWINGS_POS_CURRENT": np.float64, "AMT_INST_MIN_REGULARITY": np.float64, "AMT_PAYMENT_CURRENT": np.float64,
        "AMT_PAYMENT_TOTAL_CURRENT": np.float64, "AMT_RECEIVABLE_PRINCIPAL": np.float64, "AMT_RECIVABLE": np.float64,
        "AMT_TOTAL_RECEIVABLE": np.float64, "CNT_DRAWINGS_ATM_CURRENT": np.float64, "CNT_DRAWINGS_OTHER_CURRENT": np.float64,
        "CNT_DRAWINGS_POS_CURRENT": np.float64,  "CNT_INSTALMENT_MATURE_CUM": np.float64
    }

    bureau_dtype = {
        "SK_ID_BUREAU": np.uint64, "SK_ID_CURR": np.uint64, "DAYS_CREDIT": np.int64, "CREDIT_DAY_OVERDUE": np.int64,
        "DAYS_CREDIT_ENDDATE": np.float64, "DAYS_ENDDATE_FACT": np.float64, "AMT_CREDIT_MAX_OVERDUE": np.float64,
        "CNT_CREDIT_PROLONG": np.int64, "AMT_CREDIT_SUM": np.float64, "AMT_CREDIT_SUM_DEBT": np.float64,
        "AMT_CREDIT_SUM_LIMIT": np.float64, "AMT_CREDIT_SUM_OVERDUE": np.float64, "DAYS_CREDIT_UPDATE": np.int64,
        "AMT_ANNUITY": np.float64
    }

    previous_application_dtype = {
        "SK_ID_PREV": np.uint64, "SK_ID_CURR": np.uint64, "AMT_ANNUITY": np.float64, "AMT_APPLICATION": np.float64,
        "AMT_CREDIT": np.float64, "AMT_DOWN_PAYMENT": np.float64, "AMT_GOODS_PRICE": np.float64, "HOUR_APPR_PROCESS_START": np.float64,
        "NFLAG_LAST_APPL_IN_DAY": np.float64, "RATE_DOWN_PAYMENT": np.float64, "RATE_INTEREST_PRIMARY": np.float64,
        "RATE_INTEREST_PRIVILEGED": np.float64, "DAYS_DECISION": np.int64, "SELLERPLACE_AREA": np.int64,
        "CNT_PAYMENT": np.float64, "DAYS_FIRST_DRAWING": np.float64, "DAYS_FIRST_DUE": np.float64, "DAYS_LAST_DUE_1ST_VERSION": np.float64,
        "DAYS_LAST_DUE": np.float64, "DAYS_TERMINATION": np.float64, "NFLAG_INSURED_ON_APPROVAL": np.float64
    }
    
    bureau_balance_dtype = {
        "SK_ID_BUREAU": np.uint64, "MONTHS_BALANCE": np.int64
    }

    POS_CASH_balance = pd.read_csv(os.path.join(default_dir, "POS_CASH_balance.csv"), dtype=pos_dtype)

    installments_payments = pd.read_csv(os.path.join(default_dir, "installments_payments.csv"), dtype=install_dtype)

    credit_card_balance = pd.read_csv(os.path.join(default_dir, "credit_card_balance.csv"), dtype=card_dtype)

    bureau = pd.read_csv(os.path.join(default_dir, "bureau.csv"), dtype=bureau_dtype)

    previous_application = pd.read_csv(os.path.join(default_dir, "previous_application.csv"), dtype=previous_application_dtype)
    
    bureau_balance = pd.read_csv(os.path.join(default_dir, "bureau_balance.csv"), dtype=bureau_balance_dtype)
    
    return POS_CASH_balance, installments_payments, credit_card_balance, bureau, previous_application, bureau_balance
In [ ]:
POS_CASH_balance, installments_payments, credit_card_balance, bureau, previous_application, bureau_balance = get_balance_data()
In [ ]:
sample_submission = pd.read_csv(os.path.join(default_dir, "sample_submission.csv"))
pd.set_option("max_colwidth", 400)
HomeCredit_columns_description = pd.read_csv(os.path.join(default_dir, "HomeCredit_columns_description.csv"), encoding='mac_roman')
HomeCredit_columns_description
Out[ ]:
Unnamed: 0 Table Row Description Special
0 1 application_{train|test}.csv SK_ID_CURR ID of loan in our sample NaN
1 2 application_{train|test}.csv TARGET Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases) NaN
2 5 application_{train|test}.csv NAME_CONTRACT_TYPE Identification if loan is cash or revolving NaN
3 6 application_{train|test}.csv CODE_GENDER Gender of the client NaN
4 7 application_{train|test}.csv FLAG_OWN_CAR Flag if the client owns a car NaN
... ... ... ... ... ...
214 217 installments_payments.csv NUM_INSTALMENT_NUMBER On which installment we observe payment NaN
215 218 installments_payments.csv DAYS_INSTALMENT When the installment of previous credit was supposed to be paid (relative to application date of current loan) time only relative to the application
216 219 installments_payments.csv DAYS_ENTRY_PAYMENT When was the installments of previous credit paid actually (relative to application date of current loan) time only relative to the application
217 220 installments_payments.csv AMT_INSTALMENT What was the prescribed installment amount of previous credit on this installment NaN
218 221 installments_payments.csv AMT_PAYMENT What the client actually paid on previous credit on this installment NaN

219 rows × 5 columns

In [ ]:
def data_describe(folder):
    '''Check the number of rows, columns, missing values and duplicates.
       Count type of columns.
       Memory indication'''

    data_dict = {}
    for file in folder:
        data = pd.read_csv(file, encoding='mac_roman')
        data_dict[file] = [data.shape[0], 
                           data.shape[1],
                            round(data.isna().sum().sum()/data.size*100, 2),
                            round(data.duplicated().sum().sum()/data.size*100, 2),
                            data.select_dtypes(include=['object']).shape[1],
                            data.select_dtypes(include=['float']).shape[1],
                            data.select_dtypes(include=['int']).shape[1],
                            data.select_dtypes(include=['bool']).shape[1],
                            round(data.memory_usage().sum()/1024**2, 3)]

        comparative_table = pd.DataFrame.from_dict(data = data_dict, 
                                                   columns = ['Rows', 'Columns', '%NaN', '%Duplicate', 
                                                              'object_dtype','float_dtype', 'int_dtype', 
                                                              'bool_dtype', 'MB_Memory'], 
                                                   orient='index')
    print("SUMMARY FILES…")
    return(comparative_table)
In [ ]:
"""#Data description
data_describe(glob.glob('/content/drive/MyDrive/Data_projet_OC/*.csv'))

#glob permet la recherche de tous les CSV par *.csv"""
Out[ ]:
"#Data description\ndata_describe(glob.glob('/content/drive/MyDrive/Data_projet_OC/*.csv'))\n\n#glob permet la recherche de tous les CSV par *.csv"
In [ ]:
#Data description
data_describe(glob.glob('/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/*.csv'))

#glob permet la recherche de tous les CSV par *.csv
SUMMARY FILES…
Out[ ]:
Rows Columns %NaN %Duplicate object_dtype float_dtype int_dtype bool_dtype MB_Memory
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/application_test.csv 48744 121 23.81 0.0 16 65 40 0 44.998
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/HomeCredit_columns_description.csv 219 5 12.15 0.0 4 0 1 0 0.008
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/POS_CASH_balance.csv 10001358 8 0.07 0.0 1 2 5 0 610.435
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/credit_card_balance.csv 3840312 23 6.65 0.0 1 15 7 0 673.883
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/installments_payments.csv 13605401 8 0.01 0.0 0 5 3 0 830.408
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/application_train.csv 307511 122 24.40 0.0 16 65 41 0 286.227
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/bureau.csv 1716428 17 13.50 0.0 3 8 6 0 222.620
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/previous_application.csv 1670214 37 17.98 0.0 16 15 6 0 471.481
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/bureau_balance.csv 27299925 3 0.00 0.0 1 0 2 0 624.846
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/sample_submission.csv 48744 2 0.00 0.0 0 1 1 0 0.744
In [ ]:
def features(folder):
    '''Comparative data with missing values, 
       and many descriptive statistics.'''
    
    data_object = {}
    data_numeric = {}
    
    for file in folder:
        data = pd.read_csv(file, encoding='mac_roman')
        
        data_object[file] = [(x, data[x].dtype, 
                              data[x].isna().sum().sum(),
                              int(data[x].count())) for x in data.select_dtypes(exclude=['int', 'float'])]
        
        data_numeric[file] = [(x, data[x].dtype, 
                               int(data[x].isna().sum().sum()), 
                               int(data[x].count()), 
                               int(data[x].mean()), 
                               round(data[x].std(),1),
                               round(data[x].min(),1), 
                               round(data[x].max(),1)) for x in data.select_dtypes(exclude='object')]
        

    comparative_object = pd.DataFrame.from_dict(data = data_object, orient='index')
    dict_of_object = {name: pd.DataFrame(file) for name,file in data_object.items()}
    df1 = pd.concat(dict_of_object, axis=0)
    df1.columns=['features','dtype','nan','count']
    
    comparative_numeric = pd.DataFrame.from_dict(data = data_numeric, orient='index')
    dict_of_numeric = {name: pd.DataFrame(file) for name,file in data_numeric.items()}
    df2 = pd.concat(dict_of_numeric, axis=0)
    df2.columns=['features','dtype','nan','count', 'mean', 'std', 'min','max']
        
    return df1, df2
In [ ]:
"""#Data description
features(glob.glob('/content/drive/MyDrive/Data_projet_OC/*.csv'))[0]

#glob permet la recherche de tous les CSV par *.csv"""
Out[ ]:
"#Data description\nfeatures(glob.glob('/content/drive/MyDrive/Data_projet_OC/*.csv'))[0]\n\n#glob permet la recherche de tous les CSV par *.csv"
In [ ]:
#Data description
features(glob.glob('/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/*.csv'))[0]
Out[ ]:
features dtype nan count
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/application_test.csv 0 NAME_CONTRACT_TYPE object 0.0 48744.0
1 CODE_GENDER object 0.0 48744.0
2 FLAG_OWN_CAR object 0.0 48744.0
3 FLAG_OWN_REALTY object 0.0 48744.0
4 NAME_TYPE_SUITE object 911.0 47833.0
5 NAME_INCOME_TYPE object 0.0 48744.0
6 NAME_EDUCATION_TYPE object 0.0 48744.0
7 NAME_FAMILY_STATUS object 0.0 48744.0
8 NAME_HOUSING_TYPE object 0.0 48744.0
9 OCCUPATION_TYPE object 15605.0 33139.0
10 WEEKDAY_APPR_PROCESS_START object 0.0 48744.0
11 ORGANIZATION_TYPE object 0.0 48744.0
12 FONDKAPREMONT_MODE object 32797.0 15947.0
13 HOUSETYPE_MODE object 23619.0 25125.0
14 WALLSMATERIAL_MODE object 23893.0 24851.0
15 EMERGENCYSTATE_MODE object 22209.0 26535.0
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/HomeCredit_columns_description.csv 0 Table object 0.0 219.0
1 Row object 0.0 219.0
2 Description object 0.0 219.0
3 Special object 133.0 86.0
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/POS_CASH_balance.csv 0 NAME_CONTRACT_STATUS object 0.0 10001358.0
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/credit_card_balance.csv 0 NAME_CONTRACT_STATUS object 0.0 3840312.0
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/application_train.csv 0 NAME_CONTRACT_TYPE object 0.0 307511.0
1 CODE_GENDER object 0.0 307511.0
2 FLAG_OWN_CAR object 0.0 307511.0
3 FLAG_OWN_REALTY object 0.0 307511.0
4 NAME_TYPE_SUITE object 1292.0 306219.0
5 NAME_INCOME_TYPE object 0.0 307511.0
6 NAME_EDUCATION_TYPE object 0.0 307511.0
7 NAME_FAMILY_STATUS object 0.0 307511.0
8 NAME_HOUSING_TYPE object 0.0 307511.0
9 OCCUPATION_TYPE object 96391.0 211120.0
10 WEEKDAY_APPR_PROCESS_START object 0.0 307511.0
11 ORGANIZATION_TYPE object 0.0 307511.0
12 FONDKAPREMONT_MODE object 210295.0 97216.0
13 HOUSETYPE_MODE object 154297.0 153214.0
14 WALLSMATERIAL_MODE object 156341.0 151170.0
15 EMERGENCYSTATE_MODE object 145755.0 161756.0
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/bureau.csv 0 CREDIT_ACTIVE object 0.0 1716428.0
1 CREDIT_CURRENCY object 0.0 1716428.0
2 CREDIT_TYPE object 0.0 1716428.0
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/previous_application.csv 0 NAME_CONTRACT_TYPE object 0.0 1670214.0
1 WEEKDAY_APPR_PROCESS_START object 0.0 1670214.0
2 FLAG_LAST_APPL_PER_CONTRACT object 0.0 1670214.0
3 NAME_CASH_LOAN_PURPOSE object 0.0 1670214.0
4 NAME_CONTRACT_STATUS object 0.0 1670214.0
5 NAME_PAYMENT_TYPE object 0.0 1670214.0
6 CODE_REJECT_REASON object 0.0 1670214.0
7 NAME_TYPE_SUITE object 820405.0 849809.0
8 NAME_CLIENT_TYPE object 0.0 1670214.0
9 NAME_GOODS_CATEGORY object 0.0 1670214.0
10 NAME_PORTFOLIO object 0.0 1670214.0
11 NAME_PRODUCT_TYPE object 0.0 1670214.0
12 CHANNEL_TYPE object 0.0 1670214.0
13 NAME_SELLER_INDUSTRY object 0.0 1670214.0
14 NAME_YIELD_GROUP object 0.0 1670214.0
15 PRODUCT_COMBINATION object 346.0 1669868.0
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/bureau_balance.csv 0 STATUS object 0.0 27299925.0
In [ ]:
"""#Data description
features(glob.glob('/content/drive/MyDrive/Data_projet_OC/*.csv'))[1]

#glob permet la recherche de tous les CSV par *.csv"""
Out[ ]:
"#Data description\nfeatures(glob.glob('/content/drive/MyDrive/Data_projet_OC/*.csv'))[1]\n\n#glob permet la recherche de tous les CSV par *.csv"
In [ ]:
#Data description
features(folder=glob.glob('/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/*.csv'))[1]
Out[ ]:
features dtype nan count mean std min max
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/application_test.csv 0 SK_ID_CURR int64 0 48744 277796 103169.5 100001.0 456250.0
1 CNT_CHILDREN int64 0 48744 0 0.7 0.0 20.0
2 AMT_INCOME_TOTAL float64 0 48744 178431 101522.6 26941.5 4410000.0
3 AMT_CREDIT float64 0 48744 516740 365397.0 45000.0 2245500.0
4 AMT_ANNUITY float64 24 48720 29426 16016.4 2295.0 180576.0
... ... ... ... ... ... ... ... ... ...
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/previous_application.csv 20 NFLAG_INSURED_ON_APPROVAL float64 673065 997149 0 0.5 0.0 1.0
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/bureau_balance.csv 0 SK_ID_BUREAU int64 0 27299925 6036297 492348.9 5001709.0 6842888.0
1 MONTHS_BALANCE int64 0 27299925 -30 23.9 -96.0 0.0
/Users/amandinelecerfdefer/Desktop/Formation_Data_Scientist_OC/WORK-projet7/Data/sample_submission.csv 0 SK_ID_CURR int64 0 48744 277796 103169.5 100001.0 456250.0
1 TARGET float64 0 48744 0 0.0 0.5 0.5

288 rows × 8 columns

Exploratory Data Analysis (EDA) : Train and Test CSV

Il s'agit du tableau principal, divisé en deux fichiers pour Train (avec TARGET) et Test (sans TARGET). Données statistiques pour toutes les applications. Une ligne représente un prêt dans notre échantillon de données.

Firsts obsersations

In [ ]:
def informations(dataframe):
    """This function gives the general information of a dataset.
    It returns the number of rows and columns of the dataset.
    dataframe : dataset"""
    print(colored("\n Overview of the dataset : \n", 'red'))
    lines = dataframe.shape[0]
    columns = dataframe.shape[1]
    print(colored("The dataset has {} rows and {} "
                  "columns. \n \n".format(lines, columns), 'blue'))
    print(colored("Column's name : \n", 'green'))
    print(list(dataframe.columns))
    print("\n")
    print(colored("Column's Type : \n", 'green'))
    print(list(dataframe.dtypes))
    print("\n")
In [ ]:
informations(app_train)

 Overview of the dataset : 

The dataset has 307511 rows and 122 columns. 
 

Column's name : 

['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']


Column's Type : 

[dtype('int64'), dtype('int64'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('int64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('float64'), dtype('int64'), dtype('int64'), dtype('float64'), dtype('int64'), dtype('float64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('O'), dtype('float64'), dtype('int64'), dtype('int64'), dtype('O'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('O'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('O'), dtype('O'), dtype('float64'), dtype('O'), dtype('O'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64')]


L'ensemble d'apprentissage comporte 307 511 observations (prêts) et 122 colonnes.

In [ ]:
#Check if 'TARGET' is the only difference
print("Check theses which columns are differents in the two files.")
display(app_train.columns.difference(app_test.columns))
Check theses which columns are differents in the two files.
Index(['TARGET'], dtype='object')

La TARGET est ce que l'on nous demande de prédire : soit un 0 pour le prêt a été remboursé à temps, soit un 1 indiquant que le client a eu des difficultés de paiement. Nous pouvons d'abord examiner le nombre de prêts entrant dans chaque catégorie.

In [ ]:
app_train.groupby(['TARGET'])['SK_ID_CURR'].count()
Out[ ]:
TARGET
0    282686
1     24825
Name: SK_ID_CURR, dtype: int64
In [ ]:
def graphe_col_category(dataframe, col, size, name):
    """This function represents the categorical variables as a pie plot.
    dataframe : dataset
    size : size of the figure (X,X)"""
    values = dataframe[col].value_counts()
    labels = dataframe[col].value_counts().index
    plt.figure(figsize=size)
    
    #bar plot
    plt.subplot(2, 2, 1)
    sns.barplot(x=labels, 
            y=values,
            palette='pink')
    
    # Pie Plot
    plt.subplot(2, 2, 2)
    plt.title("Representation of the variable {}" .format(
        col), fontsize=20)
    plt.pie(values, labels=name,
            autopct='%.1f%%', shadow=True, textprops={'fontsize': 20})
    plt.axis('equal')
    plt.tight_layout()
    plt.legend()
    plt.show()
In [ ]:
graphe_col_category(app_train, 'TARGET', (20,20), ['No failure', 'failure'])

On remarque que les classes sont déséquilibrées, il y a beaucoup plus d'individus dans la classe 0 que de dans la classe 1. https://ichi.pro/fr/apprentissage-desequilibre-gerer-un-probleme-de-classe-desequilibre-71099907199579

In [ ]:
ax, fig = plt.subplots(figsize=(20,8)) 
ax = sns.countplot(y='TARGET', data=app_train)
ax.set_title("TARGET distribution")

for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_width()/len(app_train.TARGET))
        x = p.get_x() + p.get_width()
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y), fontsize=20, fontweight='bold')
        
plt.show()
In [ ]:
plt.figure(figsize=(20, 20))

plt.subplot(2, 2, 1)
sns.barplot(x=app_train['CODE_GENDER'].value_counts().index, 
            y=app_train['CODE_GENDER'].value_counts(), palette='pink').set_title("Gender distribution of individuals in the training set")

plt.subplot(2, 2, 2)
sns.barplot(x=app_test['CODE_GENDER'].value_counts().index, 
            y=app_test['CODE_GENDER'].value_counts(), palette='Blues').set_title("Gender distribution of individuals in the testing set")
Out[ ]:
Text(0.5, 1.0, 'Gender distribution of individuals in the testing set')

On peut voir qu'il y a une troisième modalité pour le sexe (également présente pour d'autres colonnes) qui semble correspondre à des données manquantes (différentes notations entre les personnes).

In [ ]:
app_train = app_train.replace('XNA', np.nan)
app_test = app_test.replace('XNA', np.nan)
In [ ]:
plt.figure(figsize=(20, 20))

plt.subplot(2, 2, 1)
sns.barplot(x=app_train['CODE_GENDER'].value_counts().index, 
            y=app_train['CODE_GENDER'].value_counts(), palette='pink').set_title("Gender distribution of individuals in the training set")

plt.subplot(2, 2, 2)
sns.barplot(x=app_test['CODE_GENDER'].value_counts().index, 
            y=app_test['CODE_GENDER'].value_counts(), palette='Blues').set_title("Gender distribution of individuals in the testing set")
Out[ ]:
Text(0.5, 1.0, 'Gender distribution of individuals in the testing set')
In [ ]:
app_train['CODE_GENDER'].describe()
Out[ ]:
count     307507
unique         2
top            F
freq      202448
Name: CODE_GENDER, dtype: object

Types of columns

In [ ]:
# number of columns of each type.
app_train.dtypes.value_counts()
Out[ ]:
float64    65
int64      41
object     16
dtype: int64

Nombre unique d'entrées pour chaque colonne de chaque type de données

In [ ]:
app_train.select_dtypes('object').apply(pd.Series.nunique)
Out[ ]:
NAME_CONTRACT_TYPE             2
CODE_GENDER                    2
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             57
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

La plupart des variables catégorielles ont un nombre relativement faible d'entrées uniques.

In [ ]:
app_train.select_dtypes('int').apply(pd.Series.nunique)
Out[ ]:
SK_ID_CURR                     307511
TARGET                              2
CNT_CHILDREN                       15
DAYS_BIRTH                      17460
DAYS_EMPLOYED                   12574
DAYS_ID_PUBLISH                  6168
FLAG_MOBIL                          2
FLAG_EMP_PHONE                      2
FLAG_WORK_PHONE                     2
FLAG_CONT_MOBILE                    2
FLAG_PHONE                          2
FLAG_EMAIL                          2
REGION_RATING_CLIENT                3
REGION_RATING_CLIENT_W_CITY         3
HOUR_APPR_PROCESS_START            24
REG_REGION_NOT_LIVE_REGION          2
REG_REGION_NOT_WORK_REGION          2
LIVE_REGION_NOT_WORK_REGION         2
REG_CITY_NOT_LIVE_CITY              2
REG_CITY_NOT_WORK_CITY              2
LIVE_CITY_NOT_WORK_CITY             2
FLAG_DOCUMENT_2                     2
FLAG_DOCUMENT_3                     2
FLAG_DOCUMENT_4                     2
FLAG_DOCUMENT_5                     2
FLAG_DOCUMENT_6                     2
FLAG_DOCUMENT_7                     2
FLAG_DOCUMENT_8                     2
FLAG_DOCUMENT_9                     2
FLAG_DOCUMENT_10                    2
FLAG_DOCUMENT_11                    2
FLAG_DOCUMENT_12                    2
FLAG_DOCUMENT_13                    2
FLAG_DOCUMENT_14                    2
FLAG_DOCUMENT_15                    2
FLAG_DOCUMENT_16                    2
FLAG_DOCUMENT_17                    2
FLAG_DOCUMENT_18                    2
FLAG_DOCUMENT_19                    2
FLAG_DOCUMENT_20                    2
FLAG_DOCUMENT_21                    2
dtype: int64
In [ ]:
app_train_int = list(app_train.select_dtypes('int'))
app_test_int = list(app_test.select_dtypes('int'))
In [ ]:
app_train.select_dtypes('float').apply(pd.Series.nunique)
Out[ ]:
AMT_INCOME_TOTAL               2548
AMT_CREDIT                     5603
AMT_ANNUITY                   13672
AMT_GOODS_PRICE                1002
REGION_POPULATION_RELATIVE       81
                              ...  
AMT_REQ_CREDIT_BUREAU_DAY         9
AMT_REQ_CREDIT_BUREAU_WEEK        9
AMT_REQ_CREDIT_BUREAU_MON        24
AMT_REQ_CREDIT_BUREAU_QRT        11
AMT_REQ_CREDIT_BUREAU_YEAR       25
Length: 65, dtype: int64

Missing data

In [ ]:
def pie_NaN(dataframe, size):
    """This function allows to make a pie plot showing the
    proportion of missing data on the whole dataset.
    dataframe : dataset
    size : size of the figure (X,X)"""
    lines = dataframe.shape[0]
    columns = dataframe.shape[1]
    # NAN data
    nb_data = dataframe.count().sum()
    # Total data = (colonnes*lignes)
    nb_totale = (columns*lines)
    # Filling rate
    rate_dataOK = (nb_data/nb_totale)
    print("The data set is filled in at {:.2%}".format(rate_dataOK))
    print("and it has {:.2%} of missing data".format(1-rate_dataOK))
    print("\n \n ")
    # Pie Plot
    rates = [rate_dataOK, 1 - rate_dataOK]
    labels = ["Données", "NAN"]
    explode = (0, 0.1)
    colors = ['gold', 'pink']
    # Plot
    plt.figure(figsize=size)
    plt.pie(rates, explode=explode, labels=labels, colors=colors,
            autopct='%.2f%%', shadow=True, textprops={'fontsize': 26})
    ttl = plt.title("Fill rate of the dataset", fontsize=32)
    ttl.set_position([0.5, 0.85])
    plt.axis('equal')
    # ax.legend(labels, loc = "upper right", fontsize = 18)
    plt.tight_layout()
    plt.show()
In [ ]:
pie_NaN(app_train, (10,10))
The data set is filled in at 75.46%
and it has 24.54% of missing data

 
 

Dans le jeu de données de formation, il y a plus de 24% de données manquantes, voyons dans quelles colonnes précisément.

In [ ]:
msno.matrix(app_train)
Out[ ]:
<AxesSubplot:>
In [ ]:
#Global view of the missing values (black)
plt.figure(figsize=(20,10))
sns.heatmap(app_train.notna(), cbar=False)
plt.show()

Les données manquantes sont plus fortement présentent sur les caractéristiques des habitats (et non sur les crédits).

In [ ]:
# Function to calculate the % of missing data on the entered dataset.
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns
In [ ]:
missing_values = missing_values_table(app_train)
missing_values.head(20)
Your selected dataframe has 122 columns.
There are 69 columns that have missing values.
Out[ ]:
Missing Values % of Total Values
COMMONAREA_AVG 214865 69.9
COMMONAREA_MEDI 214865 69.9
COMMONAREA_MODE 214865 69.9
NONLIVINGAPARTMENTS_MEDI 213514 69.4
NONLIVINGAPARTMENTS_AVG 213514 69.4
NONLIVINGAPARTMENTS_MODE 213514 69.4
FONDKAPREMONT_MODE 210295 68.4
LIVINGAPARTMENTS_AVG 210199 68.4
LIVINGAPARTMENTS_MEDI 210199 68.4
LIVINGAPARTMENTS_MODE 210199 68.4
FLOORSMIN_MEDI 208642 67.8
FLOORSMIN_AVG 208642 67.8
FLOORSMIN_MODE 208642 67.8
YEARS_BUILD_MODE 204488 66.5
YEARS_BUILD_MEDI 204488 66.5
YEARS_BUILD_AVG 204488 66.5
OWN_CAR_AGE 202929 66.0
LANDAREA_MEDI 182590 59.4
LANDAREA_AVG 182590 59.4
LANDAREA_MODE 182590 59.4
In [ ]:
# NaN on categorical variables.
app_train.select_dtypes('object').isna().sum(axis=0)
Out[ ]:
NAME_CONTRACT_TYPE                 0
CODE_GENDER                        4
FLAG_OWN_CAR                       0
FLAG_OWN_REALTY                    0
NAME_TYPE_SUITE                 1292
NAME_INCOME_TYPE                   0
NAME_EDUCATION_TYPE                0
NAME_FAMILY_STATUS                 0
NAME_HOUSING_TYPE                  0
OCCUPATION_TYPE                96391
WEEKDAY_APPR_PROCESS_START         0
ORGANIZATION_TYPE              55374
FONDKAPREMONT_MODE            210295
HOUSETYPE_MODE                154297
WALLSMATERIAL_MODE            156341
EMERGENCYSTATE_MODE           145755
dtype: int64
In [ ]:
# NaN on ints.
app_train.select_dtypes('int').isna().sum()
Out[ ]:
SK_ID_CURR                     0
TARGET                         0
CNT_CHILDREN                   0
DAYS_BIRTH                     0
DAYS_EMPLOYED                  0
DAYS_ID_PUBLISH                0
FLAG_MOBIL                     0
FLAG_EMP_PHONE                 0
FLAG_WORK_PHONE                0
FLAG_CONT_MOBILE               0
FLAG_PHONE                     0
FLAG_EMAIL                     0
REGION_RATING_CLIENT           0
REGION_RATING_CLIENT_W_CITY    0
HOUR_APPR_PROCESS_START        0
REG_REGION_NOT_LIVE_REGION     0
REG_REGION_NOT_WORK_REGION     0
LIVE_REGION_NOT_WORK_REGION    0
REG_CITY_NOT_LIVE_CITY         0
REG_CITY_NOT_WORK_CITY         0
LIVE_CITY_NOT_WORK_CITY        0
FLAG_DOCUMENT_2                0
FLAG_DOCUMENT_3                0
FLAG_DOCUMENT_4                0
FLAG_DOCUMENT_5                0
FLAG_DOCUMENT_6                0
FLAG_DOCUMENT_7                0
FLAG_DOCUMENT_8                0
FLAG_DOCUMENT_9                0
FLAG_DOCUMENT_10               0
FLAG_DOCUMENT_11               0
FLAG_DOCUMENT_12               0
FLAG_DOCUMENT_13               0
FLAG_DOCUMENT_14               0
FLAG_DOCUMENT_15               0
FLAG_DOCUMENT_16               0
FLAG_DOCUMENT_17               0
FLAG_DOCUMENT_18               0
FLAG_DOCUMENT_19               0
FLAG_DOCUMENT_20               0
FLAG_DOCUMENT_21               0
dtype: int64

Il n'y a pas de données manquantes pour ce type de données.

In [ ]:
# NaN on the floats.
app_train.select_dtypes('float').isna().sum()
Out[ ]:
AMT_INCOME_TOTAL                  0
AMT_CREDIT                        0
AMT_ANNUITY                      12
AMT_GOODS_PRICE                 278
REGION_POPULATION_RELATIVE        0
                              ...  
AMT_REQ_CREDIT_BUREAU_DAY     41519
AMT_REQ_CREDIT_BUREAU_WEEK    41519
AMT_REQ_CREDIT_BUREAU_MON     41519
AMT_REQ_CREDIT_BUREAU_QRT     41519
AMT_REQ_CREDIT_BUREAU_YEAR    41519
Length: 65, dtype: int64

Duplicate

In [ ]:
list_names = ['app_train', 'app_test']
datasets = [app_train, app_test]
for name in list_names:
    pos = list_names.index(name)
    dataset = datasets[pos]
    print("Duplicate of the dataset {}." .format(name))
    print(dataset.duplicated('SK_ID_CURR').sum())
    print("\n")
Duplicate of the dataset app_train.
0


Duplicate of the dataset app_test.
0


Il n'y a pas de doublon dans les jeux de données app_train et app_test.

Outliers, valeurs atypiques, anormales

Ces erreurs peuvent être dues à des chiffres mal saisis, à des erreurs dans l'équipement de mesure ou à des mesures valides mais extrêmes.

DAYS_BIRTH

Les chiffres de la colonne DAYS_BIRTH sont négatifs car ils sont enregistrés par rapport à la demande de prêt en cours. Pour voir ces statistiques en années, nous pouvons les multiplier par -1 et les diviser par le nombre de jours dans une année

In [ ]:
app_train['DAYS_BIRTH'].describe()
Out[ ]:
count    307511.000000
mean     -16036.995067
std        4363.988632
min      -25229.000000
25%      -19682.000000
50%      -15750.000000
75%      -12413.000000
max       -7489.000000
Name: DAYS_BIRTH, dtype: float64

Les valeurs sont négatives car enregistrées par rapport à la demande du prêt en cours. Il faut modifier ces dates pour plus de compréhension.

In [ ]:
(app_train['DAYS_BIRTH'] / -365).describe()
Out[ ]:
count    307511.000000
mean         43.936973
std          11.956133
min          20.517808
25%          34.008219
50%          43.150685
75%          53.923288
max          69.120548
Name: DAYS_BIRTH, dtype: float64

En moyenne, les clients ont 43 ans, le plus jeune a 20 ans et le plus âgé 69 ans. 50% des clients ont moins de 43 ans. Nous pouvons donc dire que l'étude est principalement axée sur les personnes d'une quarantaine d'années.

In [ ]:
app_train['AGE'] = round(app_train['DAYS_BIRTH'] / -365).astype('int')
app_test['AGE'] = round(app_test['DAYS_BIRTH'] / -365).astype('int')
In [ ]:
app_train
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR AGE
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.0220 0.0198 0.0 0.0000 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.0000 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 26
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.0790 0.0554 0.0 0.0000 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.0100 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 46
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 52
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 52
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 55
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307506 456251 0 Cash loans M N N 0 157500.0 254700.0 27558.0 225000.0 Unaccompanied Working Secondary / secondary special Separated With parents 0.032561 -9327 -236 -8456.0 -1982 NaN 1 1 0 1 0 0 Sales staff 1.0 1 1 THURSDAY 15 0 0 0 0 0 0 Services 0.145570 0.681632 NaN 0.2021 0.0887 0.9876 0.8300 0.0202 0.22 0.1034 0.6042 0.2708 0.0594 0.1484 0.1965 0.0753 0.1095 0.1008 0.0172 0.9782 0.7125 0.0172 0.0806 0.0345 0.4583 0.0417 0.0094 0.0882 0.0853 0.0 0.0125 0.2040 0.0887 0.9876 0.8323 0.0203 0.22 0.1034 0.6042 0.2708 0.0605 0.1509 0.2001 0.0757 0.1118 reg oper account block of flats 0.2898 Stone, brick No 0.0 0.0 0.0 0.0 -273.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 26
307507 456252 0 Cash loans F N Y 0 72000.0 269550.0 12001.5 225000.0 Unaccompanied Pensioner Secondary / secondary special Widow House / apartment 0.025164 -20775 365243 -4388.0 -4090 NaN 1 0 0 1 1 0 NaN 1.0 2 2 MONDAY 8 0 0 0 0 0 0 NaN NaN 0.115992 NaN 0.0247 0.0435 0.9727 0.6260 0.0022 0.00 0.1034 0.0833 0.1250 0.0579 0.0202 0.0257 0.0000 0.0000 0.0252 0.0451 0.9727 0.6406 0.0022 0.0000 0.1034 0.0833 0.1250 0.0592 0.0220 0.0267 0.0 0.0000 0.0250 0.0435 0.9727 0.6310 0.0022 0.00 0.1034 0.0833 0.1250 0.0589 0.0205 0.0261 0.0000 0.0000 reg oper account block of flats 0.0214 Stone, brick No 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 57
307508 456253 0 Cash loans F N Y 0 153000.0 677664.0 29979.0 585000.0 Unaccompanied Working Higher education Separated House / apartment 0.005002 -14966 -7921 -6737.0 -5150 NaN 1 1 0 1 0 1 Managers 1.0 3 3 THURSDAY 9 0 0 0 0 1 1 School 0.744026 0.535722 0.218859 0.1031 0.0862 0.9816 0.7484 0.0123 0.00 0.2069 0.1667 0.2083 NaN 0.0841 0.9279 0.0000 0.0000 0.1050 0.0894 0.9816 0.7583 0.0124 0.0000 0.2069 0.1667 0.2083 NaN 0.0918 0.9667 0.0 0.0000 0.1041 0.0862 0.9816 0.7518 0.0124 0.00 0.2069 0.1667 0.2083 NaN 0.0855 0.9445 0.0000 0.0000 reg oper account block of flats 0.7970 Panel No 6.0 0.0 6.0 0.0 -1909.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0 0.0 0.0 1.0 0.0 1.0 41
307509 456254 1 Cash loans F N Y 0 171000.0 370107.0 20205.0 319500.0 Unaccompanied Commercial associate Secondary / secondary special Married House / apartment 0.005313 -11961 -4786 -2562.0 -931 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 9 0 0 0 1 1 0 Business Entity Type 1 NaN 0.514163 0.661024 0.0124 NaN 0.9771 NaN NaN NaN 0.0690 0.0417 NaN NaN NaN 0.0061 NaN NaN 0.0126 NaN 0.9772 NaN NaN NaN 0.0690 0.0417 NaN NaN NaN 0.0063 NaN NaN 0.0125 NaN 0.9771 NaN NaN NaN 0.0690 0.0417 NaN NaN NaN 0.0062 NaN NaN NaN block of flats 0.0086 Stone, brick No 0.0 0.0 0.0 0.0 -322.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 33
307510 456255 0 Cash loans F N N 0 157500.0 675000.0 49117.5 675000.0 Unaccompanied Commercial associate Higher education Married House / apartment 0.046220 -16856 -1262 -5128.0 -410 NaN 1 1 1 1 1 0 Laborers 2.0 1 1 THURSDAY 20 0 0 0 0 1 1 Business Entity Type 3 0.734460 0.708569 0.113922 0.0742 0.0526 0.9881 NaN 0.0176 0.08 0.0690 0.3750 NaN NaN NaN 0.0791 NaN 0.0000 0.0756 0.0546 0.9881 NaN 0.0178 0.0806 0.0690 0.3750 NaN NaN NaN 0.0824 NaN 0.0000 0.0749 0.0526 0.9881 NaN 0.0177 0.08 0.0690 0.3750 NaN NaN NaN 0.0805 NaN 0.0000 NaN block of flats 0.0718 Panel No 0.0 0.0 0.0 0.0 -787.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 2.0 0.0 1.0 46

307511 rows × 123 columns

In [ ]:
 # Distribution des âges. (en année)
plt.figure(figsize=(15,10))
sns.histplot(app_train['DAYS_BIRTH'] / -365, stat='count', color='red', kde="True")
plt.title('Age of Client')
plt.xlabel('Age (years)')
plt.ylabel('Count')
Out[ ]:
Text(0, 0.5, 'Count')

Il n'y a pas de valeurs aberrantes puisque tous les âges sont raisonnables

DAYS_EMPLOYED How many days before the application the person started current employment,time only relative to the application

In [ ]:
app_train['DAYS_EMPLOYED'].describe()
Out[ ]:
count    307511.000000
mean      63815.045904
std      141275.766519
min      -17912.000000
25%       -2760.000000
50%       -1213.000000
75%        -289.000000
max      365243.000000
Name: DAYS_EMPLOYED, dtype: float64
In [ ]:
app_test['DAYS_EMPLOYED'].describe()
Out[ ]:
count     48744.000000
mean      67485.366322
std      144348.507136
min      -17463.000000
25%       -2910.000000
50%       -1293.000000
75%        -296.000000
max      365243.000000
Name: DAYS_EMPLOYED, dtype: float64
In [ ]:
plt.figure(figsize = (25, 15))
plt.subplot(2, 2, 1)
plt.title('Number of working day-train', weight='bold', size=18)
sns.distplot(app_train['DAYS_EMPLOYED'], kde=False, bins=30)

plt.subplot(2, 2, 2)
plt.title('Number of working day-test', weight='bold', size=18)
sns.distplot(app_test['DAYS_EMPLOYED'], kde=False, bins=30)
Out[ ]:
<AxesSubplot:title={'center':'Number of working day-test'}, xlabel='DAYS_EMPLOYED'>

Ici avec cette analyse, on peut voir qu'il y a des données anormales car le maximun représente environ 100 ans de travail (ce qui est impossible).

In [ ]:
anom = app_train[app_train['DAYS_EMPLOYED'] >= 350000]
non_anom = app_train[app_train['DAYS_EMPLOYED'] < 350000]
print('The non-anomalies default on %0.2f%% of loans' % (100 * non_anom['TARGET'].mean()))
print('The anomalies default on %0.2f%% of loans' % (100 * anom['TARGET'].mean()))
print('There are %d anomalous days of employment' % len(anom))
The non-anomalies default on 8.66% of loans
The anomalies default on 5.40% of loans
There are 55374 anomalous days of employment

When there are no anomalies, there is an average of 8,66% default. The anomalies have 5,40% of default and therefore a lower rate. We will fill in the anomalous values with not a number (np.nan) and change the number of days to the average number of days worked.

In [ ]:
anom['DAYS_EMPLOYED'].unique()
Out[ ]:
array([365243])
In [ ]:
# Create an anomalous flag column
app_train['DAYS_EMPLOYED_ANOM'] = app_train["DAYS_EMPLOYED"] == 365243
app_train['DAYS_EMPLOYED_ANOM'] = app_train['DAYS_EMPLOYED_ANOM'].astype('object')


app_test['DAYS_EMPLOYED_ANOM'] = app_test["DAYS_EMPLOYED"] == 365243
app_test['DAYS_EMPLOYED_ANOM'] = app_test['DAYS_EMPLOYED_ANOM'].astype('object')
In [ ]:
app_train.dtypes
Out[ ]:
SK_ID_CURR                      int64
TARGET                          int64
NAME_CONTRACT_TYPE             object
CODE_GENDER                    object
FLAG_OWN_CAR                   object
                               ...   
AMT_REQ_CREDIT_BUREAU_MON     float64
AMT_REQ_CREDIT_BUREAU_QRT     float64
AMT_REQ_CREDIT_BUREAU_YEAR    float64
AGE                             int64
DAYS_EMPLOYED_ANOM             object
Length: 124, dtype: object
In [ ]:
# Replace the anomalous values with nan
app_train['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)
app_test["DAYS_EMPLOYED"].replace({365243: np.nan}, inplace = True)
In [ ]:
plt.figure(figsize=(15,10))
sns.distplot(app_train['DAYS_EMPLOYED'], hist=True, rug=True, bins=25)
sns.distplot(app_test['DAYS_EMPLOYED'], hist=True, rug=True, bins=25)
plt.title('Histogram of DAYS_EMPLOYED after replacing anomalie with nan for the train and test set',
          weight='bold', size=18)
plt.xlabel('Days Employment', weight="bold")
labels= ["Train", "Test"]
plt.legend(labels)
plt.show()

print('There are %d anomalies in the test data out of %d entries in the train set\n' % (app_train["DAYS_EMPLOYED"].isna().sum(), len(app_train)))
print('There are %d anomalies in the test data out of %d entries in the test set\n' % (app_test["DAYS_EMPLOYED"].isna().sum(), len(app_test)))
There are 55374 anomalies in the test data out of 307511 entries in the train set

There are 9274 anomalies in the test data out of 48744 entries in the test set

Analyse : Train

In [ ]:
def test_train_col_category(dataframe_train, dataframe_test, col, size):
    """This function represents the categorical variables as a pie plot.
    dataframe : dataset
    size : size of the figure (X,X)"""
    values_train = dataframe_train[col].value_counts()
    labels_train = dataframe_train[col].value_counts().index
    values_test = dataframe_test[col].value_counts()
    labels_test = dataframe_test[col].value_counts().index
    
    plt.figure(figsize=size)
    
    #pie plot
    plt.subplot(2, 2, 1)
    plt.title("Representation of the variable {} for training set" .format(
        col), fontsize=20)
    plt.pie(values_train, labels=labels_train,
            autopct='%.1f%%', shadow=True, textprops={'fontsize': 20})
    
    
    # Pie Plot
    plt.subplot(2, 2, 2)
    plt.title("Representation of the variable {} for testing set" .format(
        col), fontsize=20)
    plt.pie(values_test, labels=labels_test,
            autopct='%.1f%%', shadow=True, textprops={'fontsize': 20})
    plt.axis('equal')
    plt.tight_layout()
    plt.legend()
    plt.show()
In [ ]:
def plot_stat(data, feature, title, size) : 
    
    ax, fig = plt.subplots(figsize=size) 
    ax = sns.countplot(y=feature, data=data, order=data[feature].value_counts(ascending=False).index)
    ax.set_title(title)

    for p in ax.patches:
                percentage = '{:.1f}%'.format(100 * p.get_width()/len(data[feature]))
                x = p.get_x() + p.get_width()
                y = p.get_y() + p.get_height()/2
                ax.annotate(percentage, (x, y), fontsize=20, fontweight='bold')

    plt.show()
In [ ]:
def plot_percent_target1(data, feature, title, size) : 
    
    cat_perc = data[[feature, 'TARGET']].groupby([feature],as_index=False).mean()
    cat_perc.sort_values(by='TARGET', ascending=False, inplace=True)
    
    ax, fig = plt.subplots(figsize=size) 
    ax = sns.barplot(y=feature, x='TARGET', data=cat_perc)
    ax.set_title(title)
    ax.set_xlabel("")
    ax.set_ylabel("Percent of target with value 1")

    for p in ax.patches:
                percentage = '{:.1f}%'.format(100 * p.get_width())
                x = p.get_x() + p.get_width()
                y = p.get_y() + p.get_height()/2
                ax.annotate(percentage, (x, y), fontsize=20, fontweight='bold')

    plt.show()

Loan types - Distribution du type de prêts contractés

In [ ]:
test_train_col_category(app_train, app_test, 'NAME_CONTRACT_TYPE', (20,20))

Les prêts renouvelables ne représentent que 10% du nombre total de prêts

In [ ]:
plot_percent_target1(app_train, 'NAME_CONTRACT_TYPE',"Type of contract depend on Target1", (15,10))

La majorité des crédits non remboursés sont non renouvelables. (étude par rapport à leur fréquence d'apparition).

Client gender - Distribution H/F clients selon le remboursement du prêt

Précédement, nous avons pu voir que les clients du genre féminin sont deux fois plus présents que les clients masculins dans le jeu de données.

In [ ]:
plot_percent_target1(app_train, 'CODE_GENDER',"Gender distribution depend on Target1", (15,10))

Les hommes ont tendance à moins rembourser leur crédits.

Flag own car - Distribution de la possession d'une voiture

In [ ]:
test_train_col_category(app_train, app_test, 'FLAG_OWN_CAR', (20,20))
In [ ]:
plot_percent_target1(app_train, 'FLAG_OWN_CAR',"Car owner depend on Target1", (15,10))

Le taux de non remboursement est de 8% que le client ait ou non une voiture.

Cnt Children - Distribution du nombre d'enfants

In [ ]:
test_train_col_category(app_train, app_test, 'CNT_CHILDREN', (20,20))
In [ ]:
plot_stat(app_train, 'CNT_CHILDREN', 'Children count for CSV Train', (15,10)) 
In [ ]:
plot_stat(app_test, 'CNT_CHILDREN', 'Children count for CSV Test', (15,10)) 

En ce qui concerne le nombre d'enfants, nous pouvons constater que la majorité des clients n'ont pas d'enfant. Plus de 20% des clients ont 1 enfant, 8% en ont 2 et 1% en ont 3. Les proportions sont assez équivalentes entre le test d'entraînement et le test d'essai.

Family Status - Distribution du status familial

In [ ]:
test_train_col_category(app_train, app_test, 'NAME_FAMILY_STATUS', (20,20))

La grande majorité des clients sont mariés ou en couple.

In [ ]:
plot_stat(app_train, 'NAME_FAMILY_STATUS', 'Family status for CSV Train', (15,10)) 
In [ ]:
plot_stat(app_test, 'NAME_FAMILY_STATUS', 'Family status for CSV Test', (15,10)) 
In [ ]:
plot_percent_target1(app_train, 'NAME_FAMILY_STATUS',"Family status depend on Target1", (15,10))

Le mariage civil a le pourcentage le plus élevé de non-remboursement

Income type - Distribution du type de revenus

In [ ]:
plot_stat(app_train, 'NAME_INCOME_TYPE', 'Income type for CSV Train', (15,10)) 

La très grande majorité des clients a un emploi. La plupart des clients ont des revenus de type travail, d'associé commercial, de retraite.

In [ ]:
plot_stat(app_test, 'NAME_INCOME_TYPE', 'Income type for CSV Test', (15,10)) 
In [ ]:
plot_percent_target1(app_train, 'NAME_INCOME_TYPE',"Income type depend on Target1", (15,10))

Les prêts sont non remboursés avec les clients qui ont des revenus de congé maternité et de chômage.

Type de travail - Distribution du type de travail des clients

In [ ]:
plot_stat(app_train, 'OCCUPATION_TYPE', 'Client\'s occupation for CSV Train', (15,10)) 

La plupart des clients sont des ouvriers.

In [ ]:
plot_stat(app_test, 'OCCUPATION_TYPE', 'Client\'s occupation for CSV Test', (15,10)) 
In [ ]:
plot_percent_target1(app_train, 'OCCUPATION_TYPE',"Occupation type depend on Target1", (15,10))

Les prêts sont non remboursés avec les clients qui sont des ouviers peu qualifiés.

Type d'éducation - Distribution du type d'éducation des clients

In [ ]:
plot_stat(app_train, 'NAME_EDUCATION_TYPE', 'Education type for CSV Train', (15,10)) 

La majorité des clients ont une éducation de niveau secondaire et supérieure.

In [ ]:
plot_stat(app_test, 'NAME_EDUCATION_TYPE', 'Education type for CSV Test', (15,10)) 
In [ ]:
plot_percent_target1(app_train, 'NAME_EDUCATION_TYPE',"Education type  depend on Target1", (15,10))

Les clients ayant un niveau d'éducation de début de secondaire risquent de moins rembourser les prêts que les clients ayant une éducation universitaire.

Type de logement - Distribution du type de logement des clients

In [ ]:
plot_stat(app_train, 'NAME_HOUSING_TYPE', 'Type of house for CSV Train', (15,10)) 

La majorité des clients vivent en maison ou en appartement.

In [ ]:
plot_stat(app_test, 'NAME_HOUSING_TYPE', 'Type of house for CSV Test', (15,10)) 
In [ ]:
plot_percent_target1(app_train, 'NAME_HOUSING_TYPE',"Type of house depend on Target1", (15,10))

Les clients qui payent un loyer ou qui vivent chez leurs parents ont plus de mal à rembourser un prêt.

Montant crédit moyen :

In [ ]:
target_0 =app_train.loc[app_train['TARGET'] == 0]
target_0['AMT_CREDIT'].mean()
Out[ ]:
602648.2820019386

Le montant moyen des crédit est de 602 k€ pour les personnes sachant rembourser le prêt.

In [ ]:
target_1 =app_train.loc[app_train['TARGET'] == 1]
target_1['AMT_CREDIT'].mean()
Out[ ]:
557778.527673716

Le montant moyen des crédit est de 557 k€ pour les personnes ne sachant pas rembourser leurs prêts.

Corélation avec la TARGET

Le coefficient de corrélation n'est pas la meilleure méthode pour représenter la "pertinence" d'une caractéristique, mais il nous donne une idée des relations possibles au sein des données. Voici quelques interprétations générales de la valeur absolue du coefficient de corrélation :

0,00-0,19 "très faible" 0,20 à 0,39 "faible". 0,40-0,59 "modéré 0,60-0,79 "fort 0,80-1,0 "très forte".

Voyons les relations possibles entre les variables et le TARGET en calculant le coefficiant de Pearson.

In [ ]:
# Find correlations with the target and sort
correlations = app_train.corr()['TARGET'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))
Most Positive Correlations:
 DEF_30_CNT_SOCIAL_CIRCLE       0.032248
LIVE_CITY_NOT_WORK_CITY        0.032518
OWN_CAR_AGE                    0.037612
DAYS_REGISTRATION              0.041975
FLAG_DOCUMENT_3                0.044346
REG_CITY_NOT_LIVE_CITY         0.044395
FLAG_EMP_PHONE                 0.045982
REG_CITY_NOT_WORK_CITY         0.050994
DAYS_ID_PUBLISH                0.051457
DAYS_LAST_PHONE_CHANGE         0.055218
REGION_RATING_CLIENT           0.058899
REGION_RATING_CLIENT_W_CITY    0.060893
DAYS_EMPLOYED                  0.074958
DAYS_BIRTH                     0.078239
TARGET                         1.000000
Name: TARGET, dtype: float64

Most Negative Correlations:
 EXT_SOURCE_3                 -0.178919
EXT_SOURCE_2                 -0.160472
EXT_SOURCE_1                 -0.155317
AGE                          -0.078263
FLOORSMAX_AVG                -0.044003
FLOORSMAX_MEDI               -0.043768
FLOORSMAX_MODE               -0.043226
AMT_GOODS_PRICE              -0.039645
REGION_POPULATION_RELATIVE   -0.037227
ELEVATORS_AVG                -0.034199
ELEVATORS_MEDI               -0.033863
FLOORSMIN_AVG                -0.033614
FLOORSMIN_MEDI               -0.033394
LIVINGAREA_AVG               -0.032997
LIVINGAREA_MEDI              -0.032739
Name: TARGET, dtype: float64

TARGET a la plus forte corrélation positive avec 'DAYS_BIRTH'.

In [ ]:
# Find the correlation of the positive days since birth and target
app_train['DAYS_BIRTH'] = abs(app_train['DAYS_BIRTH'])
app_train['DAYS_BIRTH'].corr(app_train['TARGET'])
Out[ ]:
-0.07823930830982712

Au fur et à mesure que le client vieillit, il existe une relation linéaire négative avec l'objectif, ce qui signifie que plus les clients vieillissent, plus ils ont tendance à rembourser leurs prêts dans les délais.

On trouve une corrélation négative, entre la taget et l'âge des clients. Plus le client est âgé moins la probabilité de défaut de paiement est haute.

Effet de l'âge sur la TARGET

In [ ]:
plt.figure(figsize = (15, 10))

# KDE plot ages when there is no default
sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'target = 0')
             
             
# KDE plot ages when there is default
sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'target = 1')

plt.xlabel('Age (years)'); plt.ylabel('Density')
plt.title('Distribution des âges', weight='bold', size=18)
plt.legend()
Out[ ]:
<matplotlib.legend.Legend at 0x7ff16731ff70>

La courbe de la TARGET == 1 penche vers l'extrémité la plus jeune de la fourchette, ce qui signifierai que les personnes jeunes ont plus de mal à rembourser. Cette variable sera probablement utile dans un modèle d'apprentissage automatique car elle affecte la cible.

ratio de prêts non remboursés dans chaque tranche d'âge (par 5 ans)

In [ ]:
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365


# Découpage par tranche d'âge.
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))

# On regroupe par tranche crée
age_groups = age_data.groupby('YEARS_BINNED').mean()
age_groups
Out[ ]:
TARGET DAYS_BIRTH YEARS_BIRTH
YEARS_BINNED
(20.0, 25.0] 0.123036 8532.795625 23.377522
(25.0, 30.0] 0.111436 10155.219250 27.822518
(30.0, 35.0] 0.102814 11854.848377 32.479037
(35.0, 40.0] 0.089414 13707.908253 37.555913
(40.0, 45.0] 0.078491 15497.661233 42.459346
(45.0, 50.0] 0.074171 17323.900441 47.462741
(50.0, 55.0] 0.066968 19196.494791 52.593136
(55.0, 60.0] 0.055314 20984.262742 57.491131
(60.0, 65.0] 0.052737 22780.547460 62.412459
(65.0, 70.0] 0.037270 24292.614340 66.555108

Pour chaque tranche d'âge de 5 ans, nous avons la moyenne de TARGET (c'est à dire la moyenne de 1 (défaut de paiement) recensée pour chaque tranche d'âge), l'année moyenne de naissance par goupe.

Nous découpons d'abord la catégorie d'âge en tranches de 5 ans chacune. Ensuite, pour chaque bac, nous calculons la valeur moyenne de la cible, ce qui nous indique le ratio de prêts non remboursés dans chaque catégorie d'âge.

échec moyen du remboursement des prêts par tranche d'âge.

In [ ]:
plt.figure(figsize=(15, 10))

# Graph the age bins and the average of the target as a bar plot
sns.barplot(age_groups.index.astype(str), 100 * age_groups['TARGET'])

# Plot labeling
plt.xticks(rotation = 75) 
plt.xlabel('Age Group (years)', weight='bold') 
plt.ylabel('Default of payment (%)', weight='bold')
plt.title("Default by age group",
          weight='bold', size=18)
Out[ ]:
Text(0.5, 1.0, 'Default by age group')

Les clients les plus jeunes sont plus susceptibles à ne pas rembourser le prêt.

Sources extérieures, les plus fortes corrélations linéaires négatives …

Ces 3 variables (EXT_SOURCE) présentant les corrélations négatives les plus fortes avec la Target. Selon la documentation, ces fonctionnalités représentent un «score normalisé à partir d'une source de données externes». Difficile de comprendre le sens exact, nous pouvons émettre l'hypothèse d'une côte de crédit cumulative établie à l'aide de différentes sources de données.

In [ ]:
ext_data = app_train[['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
ext_data_corrs = ext_data.corr()
ext_data_corrs
Out[ ]:
TARGET EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH
TARGET 1.000000 -0.155317 -0.160472 -0.178919 -0.078239
EXT_SOURCE_1 -0.155317 1.000000 0.213982 0.186846 0.600610
EXT_SOURCE_2 -0.160472 0.213982 1.000000 0.109167 0.091996
EXT_SOURCE_3 -0.178919 0.186846 0.109167 1.000000 0.205478
DAYS_BIRTH -0.078239 0.600610 0.091996 0.205478 1.000000
In [ ]:
plt.figure(figsize = (15, 10))

# Heatmap of correlations
sns.heatmap(ext_data_corrs, cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation Heatmap')
Out[ ]:
Text(0.5, 1.0, 'Correlation Heatmap')

Les trois caractéristiques EXT_SOURCE ont des corrélations négatives avec la cible, ce qui indique que plus la valeur de l'EXT_SOURCE augmente, plus le client est susceptible de rembourser le prêt. Nous pouvons également voir que DAYS_BIRTH est positivement corrélé avec les EXT_SOURCE, ce qui indique que l'un des facteurs de ces scores est peut-être l'âge du client.

In [ ]:
plt.figure(figsize = (10, 12))

# iterate through the sources
for i, source in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
    
    # create a new subplot for each source
    plt.subplot(3, 1, i + 1)
    # plot repaid loans
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, source], label = 'target == 0')
    # plot loans that were not repaid
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, source], label = 'target == 1')
    
    # Label the plots
    plt.title('Distribution of %s by Target Value' % source)
    plt.xlabel('%s' % source); plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

EXT_SOURCE_3 affiche la plus grande différence entre les valeurs de la cible. Nous pouvons clairement voir que cette caractéristique a une certaine relation avec la probabilité qu'un demandeur rembourse un prêt. La relation n'est pas très forte (en fait, elles sont toutes considérées comme très faibles), mais ces variables seront toujours utiles pour un modèle d'apprentissage automatique permettant de prédire si un demandeur remboursera ou non un prêt à temps.

In [ ]:
#Plot distribution of one feature
def plot_distribution(dataframe, feature, title, size):
    plt.figure(figsize=size)

    t0 = dataframe.loc[dataframe['TARGET'] == 0]
    t1 = dataframe.loc[dataframe['TARGET'] == 1]

    
    sns.kdeplot(t0[feature].dropna(), color='blue', label="TARGET = 0")
    sns.kdeplot(t1[feature].dropna(), color='red', label="TARGET = 1")
    plt.title(title)
    plt.ylabel('')
    plt.legend()
    plt.show()
In [ ]:
plot_distribution(app_train, 'AMT_CREDIT', "Credit distribution", (20,6))
print("                                   -------------------------------------------------------")
plot_distribution(app_train,'AMT_ANNUITY', "Annuity distribution", (20,6))
print("                                   -------------------------------------------------------")
plot_distribution(app_train,'AMT_GOODS_PRICE', "Goods price distribution", (20,6))
print("                                   -------------------------------------------------------")
plot_distribution(app_train,'DAYS_REGISTRATION', "Days of registration distribution", (20,6))
                                   -------------------------------------------------------
                                   -------------------------------------------------------
                                   -------------------------------------------------------

EDA : bureau.csv

Tous les crédits précédents du client fournis par d'autres institutions financières qui ont été rapportés au Credit Bureau (pour les clients qui ont un prêt dans notre échantillon). Pour chaque prêt dans notre échantillon, il y a autant de lignes que le nombre de crédits que le client avait dans le Credit Bureau avant la date de la demande. SK_ID_CURR est la clé reliant les données application_train | test aux données du bureau.

Il est nécessaire de fusionner "application_train" avec "bureau" pour pouvoir collecter des informations justifiant la TARGET == 1 pour chaque client.

In [ ]:
app_train.shape
Out[ ]:
(307511, 124)
In [ ]:
bureau.head()
Out[ ]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN

Nous retrouvons ici la notation XNA qui doit être remplacée par NaN (erreur de représentation NaN).

In [ ]:
bureau = bureau.replace('XNA', np.nan)
In [ ]:
application_bureau_train = app_train.merge(bureau, left_on='SK_ID_CURR', right_on='SK_ID_CURR', how='inner')
application_bureau_train.shape
Out[ ]:
(1465325, 140)

Credi_active - Distribution du statut des crédits

In [ ]:
#CREDIT_ACTIVE
plot_stat(application_bureau_train, 'CREDIT_ACTIVE',"Credit status distribution", (15,10))
print("                                   -------------------------------------------------------")
plot_percent_target1(application_bureau_train, 'CREDIT_ACTIVE',"Credit status distribution depend on Target1", (15,10))
                                   -------------------------------------------------------

credit currency - Distribution devise du crédit

In [ ]:
#CREDIT_CURRENCY
plot_stat(application_bureau_train, 'CREDIT_CURRENCY',"Credit currency distribution", (15,10))
print("                                   -------------------------------------------------------")
plot_percent_target1(application_bureau_train, 'CREDIT_CURRENCY',"Credit currency distribution depend on Target1", (15,10))
                                   -------------------------------------------------------

credit type - Distribution du type de crédit

In [ ]:
#CREDIT_TYPE
plot_stat(application_bureau_train, 'CREDIT_TYPE',"Credit type distribution", (15,10))
print("                                   -------------------------------------------------------")
plot_percent_target1(application_bureau_train, 'CREDIT_TYPE',"Credit type distribution %Target1", (15,10))
                                   -------------------------------------------------------

EDA previous_application.csv

"previous_application" contient des informations sur toutes les demandes précédentes de crédit immobilier des clients qui ont des prêts dans l'échantillon. Il y a une ligne pour chaque demande précédente liée aux prêts dans notre échantillon de données. SK_ID_CURR est la clé reliant les données application_train | test aux données previous_application.

Il est nécessaire de fusionner "application_train" avec "previous_application" pour pouvoir collecter des informations justifiant la TARGET == 1 pour chaque client.

In [ ]:
app_train.shape
Out[ ]:
(307511, 124)
In [ ]:
previous_application.head()
Out[ ]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15.0 Y 1.0 0.0 0.182832 0.867336 XAP Approved -73 Cash through the bank XAP NaN Repeater Mobile POS XNA Country-wide 35 Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11.0 Y 1.0 NaN NaN NaN XNA Approved -164 XNA XAP Unaccompanied Repeater XNA Cash x-sell Contact center -1 XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11.0 Y 1.0 NaN NaN NaN XNA Approved -301 Cash through the bank XAP Spouse, partner Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7.0 Y 1.0 NaN NaN NaN XNA Approved -512 Cash through the bank XAP NaN Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9.0 Y 1.0 NaN NaN NaN Repairs Refused -781 Cash through the bank HC NaN Repeater XNA Cash walk-in Credit and cash offices -1 XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN

Nous retrouvons ici la notation XNA qui doit être remplacée par NaN (erreur de représentation NaN).

In [ ]:
previous_application = previous_application.replace('XNA', np.nan)
In [ ]:
application_prev_train = app_train.merge(previous_application, 
                                                 left_on='SK_ID_CURR', right_on='SK_ID_CURR', how='inner')
application_prev_train.shape
Out[ ]:
(1413701, 160)

Name contract type - Distribution du type des contrats

In [ ]:
#NAME_CONTRACT_TYPE_y
plot_stat(application_prev_train, 'NAME_CONTRACT_TYPE_y',"Contract type distribution", (15,10))
print("                                   -------------------------------------------------------")
plot_percent_target1(application_prev_train, 'NAME_CONTRACT_TYPE_y',"Contract type distribution depend on Target1", (15,10))
                                   -------------------------------------------------------

Name contract status - Distribution du status des contrats

In [ ]:
#NAME_CONTRACT_STATUS
plot_stat(application_prev_train, 'NAME_CONTRACT_STATUS',"Contract status distribution", (15,10))
print("                                   -------------------------------------------------------")
plot_percent_target1(application_prev_train, 'NAME_CONTRACT_STATUS',"Contract status distribution depend on Target1", (15,10))
                                   -------------------------------------------------------

Name payment type - Distribution du mode de paiement que le client a choisi pour payer la demande précédente

In [ ]:
#NAME_PAYMENT_TYPE
plot_stat(application_prev_train, 'NAME_PAYMENT_TYPE',"Payment type distribution", (15,10))
print("                                   -------------------------------------------------------")
plot_percent_target1(application_prev_train, 'NAME_PAYMENT_TYPE',"Payment type distribution depend on Target1", (15,10))
                                   -------------------------------------------------------

Le mode de paiement se fait majoritairement en Cash via la banque.

Le défaut de remboursement ne se dintingue sur aucun type de paiement, l'égalité est quasi parfaite.

Name client type - Le client était-il un ancien ou un nouveau client lors de la demande précédente

In [ ]:
#NAME_CLIENT_TYPE
plot_stat(application_prev_train, 'NAME_CLIENT_TYPE',"Client type distribution", (15,10))
print("                                   -------------------------------------------------------")
plot_percent_target1(application_prev_train, 'NAME_CLIENT_TYPE',"Client type distribution depend on Target1", (15,10))
                                   -------------------------------------------------------

Dans le jeu de données, les clients majoritaires sont ceux qui font régulièrement des demandes de prêts mais ce sont les nouveaux clients qui ont dû mal à rembourser leurs prêts.

Création de nouvelles colonnes

In [ ]:
data = app_train.append(app_test)
In [ ]:
print('Train:' + str(app_train.shape))
print('Test:' + str(app_test.shape))
print('>>> Data:' + str(data.shape))
Train:(307511, 124)
Test:(48744, 123)
>>> Data:(356255, 124)

bureau : bureau.csv

In [ ]:
display(bureau.head())
display(bureau.shape)
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN
(1716428, 17)

Calcul du nombre total des précédents crédits pour chaque client.

PREVIOUS_APPLICATION_COUNT : Nombre de demandes antérieures des clients au crédit immobilier

In [ ]:
#Nombre total de demandes précédentes pris par chaque client.
previous_application_counts = bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(
                                       columns = {'SK_ID_BUREAU': 'PREVIOUS_APPLICATION_COUNT'})
previous_application_counts.head()
Out[ ]:
SK_ID_CURR PREVIOUS_APPLICATION_COUNT
0 100001 7
1 100002 8
2 100003 4
3 100004 2
4 100005 3
In [ ]:
#Fusionner cette nouvelle colonne dans notre échantillon de données
data = data.merge(previous_application_counts, on='SK_ID_CURR', how='left')
data.shape
Out[ ]:
(356255, 125)
In [ ]:
most_credit_type = pd.DataFrame()
most_credit_type = bureau[['SK_ID_CURR',
                         'CREDIT_TYPE']].copy()
most_credit_type
Out[ ]:
SK_ID_CURR CREDIT_TYPE
0 215354 Consumer credit
1 215354 Credit card
2 215354 Consumer credit
3 215354 Credit card
4 215354 Consumer credit
... ... ...
1716423 259355 Microloan
1716424 100044 Consumer credit
1716425 100044 Consumer credit
1716426 246829 Consumer credit
1716427 246829 Microloan

1716428 rows × 2 columns

In [ ]:
def mode_perso(serie_values):
    #En entrée une serie en sortie une valeur de l'agregation de cette série
    count = serie_values.value_counts()
    return count.idxmax()

most_credit_type_mode = most_credit_type.groupby(by="SK_ID_CURR").agg(mode_perso)
most_credit_type_mode
Out[ ]:
CREDIT_TYPE
SK_ID_CURR
100001 Consumer credit
100002 Consumer credit
100003 Credit card
100004 Consumer credit
100005 Consumer credit
... ...
456249 Consumer credit
456250 Consumer credit
456253 Consumer credit
456254 Consumer credit
456255 Consumer credit

305811 rows × 1 columns

In [ ]:
most_credit_type_mode = most_credit_type_mode.reset_index()
most_credit_type_mode.rename(columns={
    'CREDIT_TYPE': 'MOST_CREDIT_TYPE'}, inplace=True)

left_df = data
right_df = most_credit_type_mode
data = pd.merge(left_df, right_df, on='SK_ID_CURR', how='left')
data.head()
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR AGE DAYS_EMPLOYED_ANOM PREVIOUS_APPLICATION_COUNT MOST_CREDIT_TYPE
0 100002 1.0 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 9461 -637.0 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0.0 0.0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 26 False 8.0 Consumer credit
1 100003 0.0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 16765 -1188.0 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0.0 0.0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 46 False 4.0 Credit card
2 100004 0.0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 19046 -225.0 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 52 False 2.0 Consumer credit
3 100006 0.0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 19005 -3039.0 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 52 False NaN NaN
4 100007 0.0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 19932 -3038.0 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 55 False 1.0 Consumer credit

previous_application : Toutes les demandes précédentes de prêts pour le crédit immobilier des clients qui ont des prêts dans notre échantillon. Il y a une ligne pour chaque demande antérieure liée aux prêts dans notre échantillon de données.

In [ ]:
display(previous_application.head())
display(previous_application.shape)
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15.0 Y 1.0 0.0 0.182832 0.867336 XAP Approved -73 Cash through the bank XAP NaN Repeater Mobile POS NaN Country-wide 35 Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11.0 Y 1.0 NaN NaN NaN NaN Approved -164 NaN XAP Unaccompanied Repeater NaN Cash x-sell Contact center -1 NaN 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11.0 Y 1.0 NaN NaN NaN NaN Approved -301 Cash through the bank XAP Spouse, partner Repeater NaN Cash x-sell Credit and cash offices -1 NaN 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7.0 Y 1.0 NaN NaN NaN NaN Approved -512 Cash through the bank XAP NaN Repeater NaN Cash x-sell Credit and cash offices -1 NaN 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9.0 Y 1.0 NaN NaN NaN Repairs Refused -781 Cash through the bank HC NaN Repeater NaN Cash walk-in Credit and cash offices -1 NaN 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN
(1670214, 37)

PREVIOUS_LOANS_COUNT from previous_application.csv: Nombre total des précédents crédits pris par chaque client

In [ ]:
#Number of previous applications of the clients to Home Credit
previous_loan_counts = previous_application.groupby('SK_ID_CURR', 
                                                           as_index=False)['SK_ID_PREV'].count().rename(
                                                           columns = {'SK_ID_PREV': 'PREVIOUS_LOANS_COUNT'})
previous_loan_counts.head()
Out[ ]:
SK_ID_CURR PREVIOUS_LOANS_COUNT
0 100001 1
1 100002 1
2 100003 3
3 100004 1
4 100005 2
In [ ]:
#Merge this new column in our data sample
data = data.merge(previous_loan_counts, on='SK_ID_CURR', how='left')
data.shape
Out[ ]:
(356255, 127)
In [ ]:
print('data shape : ', data.shape)
data shape :  (356255, 127)

CREDIT_PERCENT_INCOME : le pourcentage du montant du crédit par rapport au revenu du client.

ANNUITY_CREDIT_PERCENT_INCOME : le pourcentage de l'annuité du prêt par rapport au revenu du client.

CREDIT_REFUND_TIME : la durée que va mettre un client à rembourser un prêt en année de crédit (l'annuité étant le montant annuel dû).

DAYS_EMPLOYED_PERCENT : le pourcentage des jours d'emploi par rapport à l'âge du client.

In [ ]:
#Pourcentage du montant final du crédit par rapport au revenus total
data['CREDIT_PERCENT_INCOME'] = data['AMT_CREDIT'] / data['AMT_INCOME_TOTAL']
#Pourcentage remboursement crédit sur les revenus total
data['ANNUITY_CREDIT_PERCENT_INCOME'] = data['AMT_ANNUITY'] / data['AMT_INCOME_TOTAL']
#DUREE DE REMBOURSEMENT DU CREDIT : PRIX TOTAL CREDIT / PRIX PAIEMENT PAR AN 
data['CREDIT_REFUND_TIME'] =  data['AMT_CREDIT'] / data['AMT_ANNUITY']
#POURCENTAGE DE JOURS TRAVAILLES
data['DAYS_EMPLOYED_PERCENT'] = data['DAYS_EMPLOYED'] / data['DAYS_BIRTH']
In [ ]:
print('data shape : ', data.shape)
data shape :  (356255, 131)
In [ ]:
plot_distribution(data,'CREDIT_PERCENT_INCOME', "Percentage of credit amount in relation to client's income", (20,6))
print("                                   -------------------------------------------------------")
plot_distribution(data,'ANNUITY_CREDIT_PERCENT_INCOME', "Percentage of loan annuity to client income", (20,6))
print("                                   -------------------------------------------------------")
plot_distribution(data,'CREDIT_REFUND_TIME', "Duration of payment in months", (20,6))
print("                                   -------------------------------------------------------")
plot_distribution(data,'DAYS_EMPLOYED_PERCENT', "Percentage of days of employment in relation to client's age", (20,6))
                                   -------------------------------------------------------
                                   -------------------------------------------------------
                                   -------------------------------------------------------
In [ ]:
data = data.replace(' ', '_', regex=True)

Séparation de data en train et test comme à l'origine

In [ ]:
data_train = data[data['SK_ID_CURR'].isin(app_train["SK_ID_CURR"])]
data_test = data[data['SK_ID_CURR'].isin(app_test["SK_ID_CURR"])]

data_test = data_test.drop('TARGET', axis=1)
In [ ]:
print('Training Features shape origin: ', app_train.shape)
print('Testing Features shape origin: ', app_test.shape)
Training Features shape origin:  (307511, 124)
Testing Features shape origin:  (48744, 123)
In [ ]:
print('Training Features shape after merging: ', data_train.shape)
print('Testing Features shape after merging: ', data_test.shape)
Training Features shape after merging:  (307511, 131)
Testing Features shape after merging:  (48744, 130)

Cleaning

Suppression des lignes qui ont un taux de remplissage inférieur à une limite

In [ ]:
find_rate = data_train.copy()
find_rate = find_rate.replace(to_replace = '^nan$', value = np.nan, regex=True)
nb_lines = find_rate.shape[0]
nb_columns = find_rate.shape[1]
                
find_rate['taux_remplissage_lines'] = (data_train.apply(lambda x: x.count(), axis=1)/nb_columns)
    
    
filling_rate = []
remove_line = []

for i in range(0, 11, 1):
    taux_remplissage = i/10.0
    
    filling_rate.append(taux_remplissage*100)

    df_2 = find_rate[find_rate['taux_remplissage_lines'] > taux_remplissage]
  
            
    #number of lines in the end
    nb_lines_supp = nb_lines - df_2.shape[0]
    remove_line.append(nb_lines_supp)
  
    del df_2['taux_remplissage_lines']
    
find_rate = pd.DataFrame(
    {'filling_rate': filling_rate,
     'remove_lines': remove_line
    })
find_rate
Out[ ]:
filling_rate remove_lines
0 0.0 0
1 10.0 0
2 20.0 0
3 30.0 0
4 40.0 0
5 50.0 6
6 60.0 37003
7 70.0 152704
8 80.0 169492
9 90.0 212032
10 100.0 307511
In [ ]:
sns.lineplot(data=find_rate, x="filling_rate", y="remove_lines")
Out[ ]:
<AxesSubplot:xlabel='filling_rate', ylabel='remove_lines'>
In [ ]:
def filtration_line(dataframe, taux_remplissage):
    df = dataframe.copy()
    dataframe = dataframe.replace(to_replace = '^nan$', value = np.nan, regex=True)
    #number of line at origin
    nb_lines = dataframe.shape[0]
                
    df['taux_remplissage_lines'] = (dataframe.apply(lambda x: x.count(), axis=1)/nb_columns)

    df_2 = df[df['taux_remplissage_lines'] > taux_remplissage]
  
            
    #number of lines in the end
    nb_lines_supp = nb_lines - df_2.shape[0]

    print("Number of lines with a fill rate higher than {:.2%} : {} lines.".format(taux_remplissage, df_2.shape[0]))
    print("Number of lines deleted : {} lines".format(nb_lines_supp))
    print(df_2.shape)
    
    del df_2['taux_remplissage_lines']

    return df_2
In [ ]:
app_train_clean_lines = filtration_line(data_train, 0.7)
Number of lines with a fill rate higher than 70.00% : 154807 lines.
Number of lines deleted : 152704 lines
(154807, 132)
In [ ]:
app_train_clean_lines.shape
Out[ ]:
(154807, 131)
In [ ]:
app_train_clean_lines.head()
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR AGE DAYS_EMPLOYED_ANOM PREVIOUS_APPLICATION_COUNT MOST_CREDIT_TYPE PREVIOUS_LOANS_COUNT CREDIT_PERCENT_INCOME ANNUITY_CREDIT_PERCENT_INCOME CREDIT_REFUND_TIME DAYS_EMPLOYED_PERCENT
0 100002 1.0 Cash_loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary_/_secondary_special Single_/_not_married House_/_apartment 0.018801 9461 -637.0 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business_Entity_Type_3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.0220 0.0198 0.0000 0.000 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.0000 reg_oper_account block_of_flats 0.0149 Stone,_brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 26 False 8.0 Consumer_credit 1.0 2.007889 0.121978 16.461104 -0.067329
1 100003 0.0 Cash_loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State_servant Higher_education Married House_/_apartment 0.003541 16765 -1188.0 -1186.0 -291 NaN 1 1 0 1 1 0 Core_staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.0790 0.0554 0.0000 0.000 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.0100 reg_oper_account block_of_flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 46 False 4.0 Credit_card 3.0 4.790750 0.132217 36.234085 -0.070862
12 100016 0.0 Cash_loans F N Y 0 67500.0 80865.0 5881.5 67500.0 Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment 0.031329 13439 -2717.0 -311.0 -3227 NaN 1 1 1 1 1 0 Laborers 2.0 2 2 FRIDAY 10 0 0 0 0 0 0 Business_Entity_Type_2 0.464831 0.715042 0.176653 0.0825 NaN 0.9811 NaN NaN 0.00 0.2069 0.1667 NaN 0.0135 NaN 0.0778 NaN 0.0000 0.0840 NaN 0.9811 NaN NaN 0.0000 0.2069 0.1667 NaN 0.0138 NaN 0.0810 NaN 0.000 0.0833 NaN 0.9811 NaN NaN 0.00 0.2069 0.1667 NaN 0.0137 NaN 0.0792 NaN 0.0000 reg_oper_account block_of_flats 0.0612 NaN No 0.0 0.0 0.0 0.0 -2370.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 1.0 0.0 0.0 37 False 7.0 Consumer_credit 4.0 1.198000 0.087133 13.749044 -0.202173
13 100017 0.0 Cash_loans M Y N 1 225000.0 918468.0 28966.5 697500.0 Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment 0.016612 14086 -3028.0 -643.0 -4911 23.0 1 1 0 1 0 0 Drivers 3.0 2 2 THURSDAY 13 0 0 0 0 0 0 Self-employed NaN 0.566907 0.770087 0.1474 0.0973 0.9806 0.7348 0.0582 0.16 0.1379 0.3333 0.3750 0.0931 0.1202 0.1397 0.0000 0.0000 0.1502 0.1010 0.9806 0.7452 0.0587 0.1611 0.1379 0.3333 0.3750 0.0952 0.1313 0.1456 0.0000 0.000 0.1489 0.0973 0.9806 0.7383 0.0585 0.16 0.1379 0.3333 0.3750 0.0947 0.1223 0.1422 0.0000 0.0000 reg_oper_account block_of_flats 0.1417 Panel No 0.0 0.0 0.0 0.0 -4.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 39 False 6.0 Consumer_credit 2.0 4.082080 0.128740 31.707938 -0.214965
14 100018 0.0 Cash_loans F N Y 0 189000.0 773680.5 32778.0 679500.0 Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment 0.010006 14583 -203.0 -615.0 -2056 NaN 1 1 0 1 0 0 Laborers 2.0 2 1 MONDAY 9 0 0 0 0 0 0 Transport:_type_2 0.721940 0.642656 NaN 0.3495 0.1335 0.9985 0.9796 0.1143 0.40 0.1724 0.6667 0.7083 0.1758 0.2849 0.3774 0.0193 0.1001 0.3561 0.1386 0.9985 0.9804 0.1153 0.4028 0.1724 0.6667 0.7083 0.1798 0.3113 0.3932 0.0195 0.106 0.3529 0.1335 0.9985 0.9799 0.1150 0.40 0.1724 0.6667 0.7083 0.1789 0.2899 0.3842 0.0194 0.1022 reg_oper_account block_of_flats 0.3811 Panel No 0.0 0.0 0.0 0.0 -188.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 40 False NaN NaN 4.0 4.093548 0.173429 23.603652 -0.013920
In [ ]:
app_test_clean_lines = filtration_line(data_test, 0.7)
Number of lines with a fill rate higher than 70.00% : 25345 lines.
Number of lines deleted : 23399 lines
(25345, 131)

Suppression des colonnes qui ont un taux de remplissage inférieur à une limite

In [ ]:
find_rate = app_train_clean_lines.copy()
find_rate = find_rate.replace(to_replace = '^nan$', value = np.nan, regex=True)
nb_colonne = find_rate.shape[1]

filling_rate = []
remove_col = []

for i in range(0, 11, 1):
    taux_remplissage = i/10.0
    
    filling_rate.append(taux_remplissage*100)

    df = find_rate[find_rate.columns[1-find_rate.isnull().mean() > taux_remplissage]]
    
    #number of columns at the end
    nb_colonne_supp = nb_colonne - df.shape[1]
    
    remove_col.append(nb_colonne_supp)
  
    
find_rate = pd.DataFrame(
    {'filling_rate': filling_rate,
     'remove_columns': remove_col
    })
find_rate  
    
Out[ ]:
filling_rate remove_columns
0 0.0 0
1 10.0 0
2 20.0 0
3 30.0 0
4 40.0 1
5 50.0 2
6 60.0 5
7 70.0 19
8 80.0 19
9 90.0 40
10 100.0 131
In [ ]:
sns.lineplot(data=find_rate, x="filling_rate", y="remove_columns")
Out[ ]:
<AxesSubplot:xlabel='filling_rate', ylabel='remove_columns'>
In [ ]:
def filtration_columns(dataframe, taux_remplissage):
    dataframe = dataframe.replace(to_replace = '^nan$', value = np.nan, regex=True)
    #number of columns at origin
    nb_colonne = dataframe.shape[1]
    
    df = dataframe[dataframe.columns[1-dataframe.isnull().mean() > taux_remplissage]]

    #number of columns at the end
    nb_colonne_supp = nb_colonne - df.shape[1]

    print("Nombre de colonnes avec un taux de remplissage supérieur à {:.2%} : {} colonnes.".format(taux_remplissage, df.shape[1]))
    print("Nombre de colonnes supprimées : {} colonnes".format(nb_colonne_supp))

    return df
In [ ]:
app_train_reduced = filtration_columns(app_train_clean_lines, 0.8)
Nombre de colonnes avec un taux de remplissage supérieur à 80.00% : 112 colonnes.
Nombre de colonnes supprimées : 19 colonnes
In [ ]:
app_train_reduced.head()
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG LIVINGAREA_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE LANDAREA_MODE LIVINGAREA_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI LIVINGAREA_MEDI NONLIVINGAREA_MEDI HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR AGE DAYS_EMPLOYED_ANOM PREVIOUS_APPLICATION_COUNT MOST_CREDIT_TYPE PREVIOUS_LOANS_COUNT CREDIT_PERCENT_INCOME ANNUITY_CREDIT_PERCENT_INCOME CREDIT_REFUND_TIME DAYS_EMPLOYED_PERCENT
0 100002 1.0 Cash_loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary_/_secondary_special Single_/_not_married House_/_apartment 0.018801 9461 -637.0 -3648.0 -2120 1 1 0 1 1 0 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business_Entity_Type_3 0.262949 0.139376 0.0247 0.0369 0.9722 0.00 0.0690 0.0833 0.0369 0.0190 0.0000 0.0252 0.0383 0.9722 0.0000 0.0690 0.0833 0.0377 0.0198 0.000 0.0250 0.0369 0.9722 0.00 0.0690 0.0833 0.0375 0.0193 0.0000 block_of_flats 0.0149 Stone,_brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 26 False 8.0 Consumer_credit 1.0 2.007889 0.121978 16.461104 -0.067329
1 100003 0.0 Cash_loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State_servant Higher_education Married House_/_apartment 0.003541 16765 -1188.0 -1186.0 -291 1 1 0 1 1 0 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.622246 NaN 0.0959 0.0529 0.9851 0.08 0.0345 0.2917 0.0130 0.0549 0.0098 0.0924 0.0538 0.9851 0.0806 0.0345 0.2917 0.0128 0.0554 0.000 0.0968 0.0529 0.9851 0.08 0.0345 0.2917 0.0132 0.0558 0.0100 block_of_flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 46 False 4.0 Credit_card 3.0 4.790750 0.132217 36.234085 -0.070862
12 100016 0.0 Cash_loans F N Y 0 67500.0 80865.0 5881.5 67500.0 Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment 0.031329 13439 -2717.0 -311.0 -3227 1 1 1 1 1 0 2.0 2 2 FRIDAY 10 0 0 0 0 0 0 Business_Entity_Type_2 0.715042 0.176653 0.0825 NaN 0.9811 0.00 0.2069 0.1667 0.0135 0.0778 0.0000 0.0840 NaN 0.9811 0.0000 0.2069 0.1667 0.0138 0.0810 0.000 0.0833 NaN 0.9811 0.00 0.2069 0.1667 0.0137 0.0792 0.0000 block_of_flats 0.0612 NaN No 0.0 0.0 0.0 0.0 -2370.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 1.0 0.0 0.0 37 False 7.0 Consumer_credit 4.0 1.198000 0.087133 13.749044 -0.202173
13 100017 0.0 Cash_loans M Y N 1 225000.0 918468.0 28966.5 697500.0 Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment 0.016612 14086 -3028.0 -643.0 -4911 1 1 0 1 0 0 3.0 2 2 THURSDAY 13 0 0 0 0 0 0 Self-employed 0.566907 0.770087 0.1474 0.0973 0.9806 0.16 0.1379 0.3333 0.0931 0.1397 0.0000 0.1502 0.1010 0.9806 0.1611 0.1379 0.3333 0.0952 0.1456 0.000 0.1489 0.0973 0.9806 0.16 0.1379 0.3333 0.0947 0.1422 0.0000 block_of_flats 0.1417 Panel No 0.0 0.0 0.0 0.0 -4.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 39 False 6.0 Consumer_credit 2.0 4.082080 0.128740 31.707938 -0.214965
14 100018 0.0 Cash_loans F N Y 0 189000.0 773680.5 32778.0 679500.0 Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment 0.010006 14583 -203.0 -615.0 -2056 1 1 0 1 0 0 2.0 2 1 MONDAY 9 0 0 0 0 0 0 Transport:_type_2 0.642656 NaN 0.3495 0.1335 0.9985 0.40 0.1724 0.6667 0.1758 0.3774 0.1001 0.3561 0.1386 0.9985 0.4028 0.1724 0.6667 0.1798 0.3932 0.106 0.3529 0.1335 0.9985 0.40 0.1724 0.6667 0.1789 0.3842 0.1022 block_of_flats 0.3811 Panel No 0.0 0.0 0.0 0.0 -188.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 40 False NaN NaN 4.0 4.093548 0.173429 23.603652 -0.013920
In [ ]:
app_train_reduced.shape
Out[ ]:
(154807, 112)
In [ ]:
if 'EXT_SOURCE_1' in app_train_reduced.columns:
    print("The column EXT_SOURCE_1 is in the datatset.")
else :
    app_train_reduced['EXT_SOURCE_1'] = app_train_clean_lines['EXT_SOURCE_1']
    print("The column EXT_SOURCE_1 has been added to the dataset.")

if 'EXT_SOURCE_2' in app_train_reduced.columns:
    print("The column EXT_SOURCE_2 is in the datatset.")
else :
    app_train_reduced['EXT_SOURCE_2'] = app_train_clean_lines['EXT_SOURCE_2']
    print("The column EXT_SOURCE_2 has been added to the dataset.")
    
if 'EXT_SOURCE_3' in app_train_reduced.columns:
    print("The column EXT_SOURCE_3 is in the datatset.")
else :
    app_train_reduced['EXT_SOURCE_3'] = app_train_clean_lines['EXT_SOURCE_3']
    print("The column EXT_SOURCE_3 has been added to the dataset.")
#----------------------------------------------------------------------------    

if 'DAYS_EMPLOYED' in app_train_reduced.columns:
    print("The column DAYS_EMPLOYED is in the datatset.")
else :
    app_train_reduced['DAYS_EMPLOYED'] = app_train_clean_lines['DAYS_EMPLOYED']
    print("The column DAYS_EMPLOYED has been added to the dataset.")
#----------------------------------------------------------------------------    
    
if 'DAYS_BIRTH' in app_train_reduced.columns:
    print("The column DAYS_BIRTH is in the datatset.")
else :
    app_train_reduced['DAYS_BIRTH'] = app_train_clean_lines['DAYS_BIRTH']
    print("The column DAYS_BIRTH has been added to the dataset.")
    
#----------------------------------------------------------------------------    
    
if 'AGE' in app_train_reduced.columns:
    print("The column AGE is in the datatset.")
else :
    app_train_reduced['AGE'] = app_train_clean_lines['AGE']
    print("The column AGE has been added to the dataset.")
The column EXT_SOURCE_1 has been added to the dataset.
The column EXT_SOURCE_2 is in the datatset.
The column EXT_SOURCE_3 is in the datatset.
The column DAYS_EMPLOYED is in the datatset.
The column DAYS_BIRTH is in the datatset.
The column AGE is in the datatset.
In [ ]:
app_train_reduced.shape
Out[ ]:
(154807, 113)
In [ ]:
filter_columns = list(app_train_reduced.columns)
In [ ]:
def remove_columns(dataframe, filter_columns):
    """dataframe : dataframe to filter
    filter_columns : columns to keep"""
    new = pd.DataFrame()

    for column in filter_columns:
        try:
            new[column] = dataframe[column]
        except:
            print('...colonne non présente : ', column)
            print('\n')
    print("All selected columns have been kept from the dataset")
    return new
In [ ]:
app_test_reduced = remove_columns(app_test_clean_lines, filter_columns)
...colonne non présente :  TARGET


All selected columns have been kept from the dataset
In [ ]:
app_test_reduced
Out[ ]:
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG LIVINGAREA_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE LANDAREA_MODE LIVINGAREA_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI LIVINGAREA_MEDI NONLIVINGAREA_MEDI HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR AGE DAYS_EMPLOYED_ANOM PREVIOUS_APPLICATION_COUNT MOST_CREDIT_TYPE PREVIOUS_LOANS_COUNT CREDIT_PERCENT_INCOME ANNUITY_CREDIT_PERCENT_INCOME CREDIT_REFUND_TIME DAYS_EMPLOYED_PERCENT EXT_SOURCE_1
307511 100001 Cash_loans F N Y 0 135000.0 568800.0 20560.5 450000.0 Unaccompanied Working Higher_education Married House_/_apartment 0.018850 -19241 -2329.0 -5170.0 -812 1 1 0 1 0 1 2.0 2 2 TUESDAY 18 0 0 0 0 0 0 Kindergarten 0.789654 0.159520 0.0660 0.0590 0.9732 NaN 0.1379 0.1250 NaN 0.0505 NaN 0.0672 0.0612 0.9732 NaN 0.1379 0.1250 NaN 0.0526 NaN 0.0666 0.0590 0.9732 NaN 0.1379 0.1250 NaN 0.0514 NaN block_of_flats 0.0392 Stone,_brick No 0.0 0.0 0.0 0.0 -1740.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 53 False 7.0 Consumer_credit 1.0 4.213333 0.152300 27.664697 0.121044 0.752614
307514 100028 Cash_loans F N Y 2 315000.0 1575000.0 49018.5 1575000.0 Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment 0.026392 -13976 -1866.0 -2000.0 -4208 1 1 0 1 1 0 4.0 2 2 WEDNESDAY 11 0 0 0 0 0 0 Business_Entity_Type_3 0.509677 0.612704 0.3052 0.1974 0.9970 0.32 0.2759 0.3750 0.2042 0.3673 0.0800 0.3109 0.2049 0.9970 0.3222 0.2759 0.3750 0.2089 0.3827 0.0847 0.3081 0.1974 0.9970 0.32 0.2759 0.3750 0.2078 0.3739 0.0817 block_of_flats 0.3700 Panel No 0.0 0.0 0.0 0.0 -1805.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0 38 False 12.0 Consumer_credit 5.0 5.000000 0.155614 32.130726 0.133515 0.525734
307516 100042 Cash_loans F Y Y 0 270000.0 959688.0 34600.5 810000.0 Unaccompanied State_servant Secondary_/_secondary_special Married House_/_apartment 0.025164 -18604 -12009.0 -6116.0 -2027 1 1 0 1 1 0 2.0 2 2 MONDAY 15 0 0 0 0 0 0 Government 0.628904 0.392774 0.2412 0.0084 0.9821 0.16 0.1379 0.3333 0.1683 0.2218 0.0731 0.2458 0.0088 0.9821 0.1611 0.1379 0.3333 0.1721 0.2311 0.0774 0.2436 0.0084 0.9821 0.16 0.1379 0.3333 0.1712 0.2258 0.0746 block_of_flats 0.2151 Block No 0.0 0.0 0.0 0.0 -1705.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 1.0 2.0 51 False 14.0 Consumer_credit 9.0 3.554400 0.128150 27.736247 0.645506 NaN
307519 100066 Cash_loans F N Y 0 315000.0 364896.0 28957.5 315000.0 Unaccompanied State_servant Higher_education Married House_/_apartment 0.046220 -12744 -1013.0 -1686.0 -3171 1 1 0 1 0 0 2.0 1 1 THURSDAY 18 0 0 0 0 0 0 School 0.808788 0.522697 0.1031 0.1115 0.9781 0.00 0.2069 0.1667 NaN NaN NaN 0.1050 0.1157 0.9782 0.0000 0.2069 0.1667 NaN NaN NaN 0.1041 0.1115 0.9781 0.00 0.2069 0.1667 NaN NaN NaN block_of_flats 0.0702 Stone,_brick No 0.0 0.0 0.0 0.0 -829.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 5.0 35 False 3.0 Consumer_credit 9.0 1.158400 0.091929 12.601088 0.079488 0.718507
307521 100074 Cash_loans F N Y 0 67500.0 675000.0 25447.5 675000.0 Unaccompanied Pensioner Secondary_/_secondary_special Married House_/_apartment 0.003122 -23670 NaN -7490.0 -4136 1 0 0 1 1 0 2.0 3 3 TUESDAY 11 0 0 0 0 0 0 NaN 0.660015 0.298595 0.0216 0.0545 0.9781 0.00 0.1034 0.0417 0.0095 0.0113 NaN 0.0221 0.0566 0.9782 0.0000 0.1034 0.0417 0.0097 0.0117 NaN 0.0219 0.0545 0.9781 0.00 0.1034 0.0417 0.0096 0.0115 NaN block_of_flats 0.0136 Stone,_brick No 0.0 0.0 0.0 0.0 -1671.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 65 True 3.0 Consumer_credit 2.0 10.000000 0.377000 26.525199 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
356247 456170 Cash_loans F Y Y 0 157500.0 500490.0 48888.0 450000.0 Children Pensioner Secondary_/_secondary_special Single_/_not_married House_/_apartment 0.006671 -21780 NaN -10745.0 -5249 1 0 0 1 1 0 1.0 2 2 WEDNESDAY 11 0 0 0 0 0 0 NaN 0.471719 0.631355 0.0433 0.0527 0.9851 0.00 0.1034 0.1667 0.0368 0.0422 0.0024 0.0441 0.0547 0.9851 0.0000 0.1034 0.1667 0.0376 0.0440 0.0026 0.0437 0.0527 0.9851 0.00 0.1034 0.1667 0.0374 0.0430 0.0025 block_of_flats 0.0377 Panel No 0.0 0.0 0.0 0.0 -611.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0 60 True 3.0 Consumer_credit 2.0 3.177714 0.310400 10.237482 NaN 0.851722
356248 456189 Cash_loans F N Y 0 270000.0 360000.0 28570.5 360000.0 Unaccompanied Commercial_associate Secondary_/_secondary_special Separated Rented_apartment 0.026392 -19397 -119.0 -4386.0 -2945 1 1 0 1 0 0 1.0 2 2 SUNDAY 12 0 0 0 0 0 0 Business_Entity_Type_3 0.689832 0.255332 0.1216 NaN 0.9935 0.12 0.1034 0.3750 NaN 0.1532 0.0000 0.1239 NaN 0.9935 0.1208 0.1034 0.3750 NaN 0.1596 0.0000 0.1228 NaN 0.9935 0.12 0.1034 0.3750 NaN 0.1559 0.0000 block_of_flats 0.1205 Stone,_brick No 3.0 0.0 3.0 0.0 -1252.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0 53 False 6.0 Consumer_credit 11.0 1.333333 0.105817 12.600410 0.006135 0.442558
356249 456202 Cash_loans F Y N 3 135000.0 252022.5 23112.0 217561.5 Unaccompanied Working Secondary_/_secondary_special Civil_marriage House_/_apartment 0.009175 -11708 -369.0 -174.0 -4178 1 1 0 1 1 0 5.0 2 2 TUESDAY 16 0 0 0 0 0 0 Self-employed 0.762352 0.240541 0.0227 NaN 0.9786 0.00 NaN 0.0417 NaN 0.0171 NaN 0.0231 NaN 0.9786 0.0000 NaN 0.0417 NaN 0.0178 NaN 0.0229 NaN 0.9786 0.00 NaN 0.0417 NaN 0.0174 NaN NaN 0.0134 NaN No 0.0 0.0 0.0 0.0 -987.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 2.0 2.0 32 False 5.0 Credit_card 6.0 1.866833 0.171200 10.904400 0.031517 0.174671
356252 456223 Cash_loans F Y Y 1 202500.0 315000.0 33205.5 315000.0 Unaccompanied Commercial_associate Secondary_/_secondary_special Married House_/_apartment 0.026392 -15922 -3037.0 -2681.0 -1504 1 1 0 1 1 0 3.0 2 2 WEDNESDAY 12 0 0 0 0 0 0 Business_Entity_Type_3 0.632770 0.283712 0.1113 0.1364 0.9955 0.16 0.1379 0.3333 NaN 0.1383 0.0542 0.1134 0.1415 0.9955 0.1611 0.1379 0.3333 NaN 0.1441 0.0574 0.1124 0.1364 0.9955 0.16 0.1379 0.3333 NaN 0.1408 0.0554 block_of_flats 0.1663 Stone,_brick No 0.0 0.0 0.0 0.0 -838.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 3.0 1.0 44 False 5.0 Consumer_credit 2.0 1.555556 0.163978 9.486380 0.190742 0.733503
356253 456224 Cash_loans M N N 0 225000.0 450000.0 25128.0 450000.0 Family Commercial_associate Higher_education Married House_/_apartment 0.018850 -13968 -2731.0 -1461.0 -1364 1 1 1 1 1 0 2.0 2 2 MONDAY 10 0 1 1 0 1 1 Self-employed 0.445701 0.595456 0.1629 0.0723 0.9896 0.16 0.0690 0.6250 NaN 0.1563 0.1490 0.1660 0.0750 0.9896 0.1611 0.0690 0.6250 NaN 0.1204 0.1577 0.1645 0.0723 0.9896 0.16 0.0690 0.6250 NaN 0.1591 0.1521 block_of_flats 0.1974 Panel No 0.0 0.0 0.0 0.0 -2308.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0 38 False 17.0 Consumer_credit 5.0 2.000000 0.111680 17.908309 0.195518 0.373090

25345 rows × 112 columns

Imputation missing data

In [ ]:
#Imputation with pandas 
def imputation_pandas(dataframe):
  short_cleaned_impute = dataframe.copy()
  for col_name in dataframe:
    short_cleaned_impute[col_name] = dataframe[col_name].interpolate(method='linear', inplace=False, limit_direction="both").ffill().bfill()
    
  return short_cleaned_impute

TRAIN

In [ ]:
app_train_reduced.dtypes.unique()
Out[ ]:
array([dtype('int64'), dtype('float64'), dtype('O'), dtype('bool')],
      dtype=object)
In [ ]:
# Df des features numériques.
df_num_train = app_train_reduced.select_dtypes('number').reset_index(drop = True)

# Df des features catégoriques.
df_categ_train = app_train_reduced.select_dtypes('object').reset_index(drop = True)

Imputation des colonne numérique par la méthode d'interpolation linéaire

In [ ]:
df_num_imputed_train = imputation_pandas(df_num_train)
df_num_imputed_train
Out[ ]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG LIVINGAREA_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE LANDAREA_MODE LIVINGAREA_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI LIVINGAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR AGE PREVIOUS_APPLICATION_COUNT PREVIOUS_LOANS_COUNT CREDIT_PERCENT_INCOME ANNUITY_CREDIT_PERCENT_INCOME CREDIT_REFUND_TIME DAYS_EMPLOYED_PERCENT EXT_SOURCE_1
0 100002 1.0 0 202500.0 406597.5 24700.5 351000.0 0.018801 9461 -637.0 -3648.0 -2120 1 1 0 1 1 0 1.0 2 2 10 0 0 0 0 0 0 0.262949 0.139376 0.0247 0.0369 0.9722 0.00 0.0690 0.0833 0.0369 0.0190 0.0000 0.0252 0.0383 0.9722 0.0000 0.0690 0.0833 0.0377 0.0198 0.0000 0.0250 0.0369 0.9722 0.00 0.0690 0.0833 0.0375 0.0193 0.0000 0.0149 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.000000 0.0 1.000000 26 8.0 1.0 2.007889 0.121978 16.461104 -0.067329 0.083037
1 100003 0.0 0 270000.0 1293502.5 35698.5 1129500.0 0.003541 16765 -1188.0 -1186.0 -291 1 1 0 1 1 0 2.0 1 1 11 0 0 0 0 0 0 0.622246 0.158014 0.0959 0.0529 0.9851 0.08 0.0345 0.2917 0.0130 0.0549 0.0098 0.0924 0.0538 0.9851 0.0806 0.0345 0.2917 0.0128 0.0554 0.0000 0.0968 0.0529 0.9851 0.08 0.0345 0.2917 0.0132 0.0558 0.0100 0.0714 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.000000 0.0 0.000000 46 4.0 3.0 4.790750 0.132217 36.234085 -0.070862 0.311267
2 100016 0.0 0 67500.0 80865.0 5881.5 67500.0 0.031329 13439 -2717.0 -311.0 -3227 1 1 1 1 1 0 2.0 2 2 10 0 0 0 0 0 0 0.715042 0.176653 0.0825 0.0751 0.9811 0.00 0.2069 0.1667 0.0135 0.0778 0.0000 0.0840 0.0774 0.9811 0.0000 0.2069 0.1667 0.0138 0.0810 0.0000 0.0833 0.0751 0.9811 0.00 0.2069 0.1667 0.0137 0.0792 0.0000 0.0612 0.0 0.0 0.0 0.0 -2370.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 1.000000 0.0 0.000000 37 7.0 4.0 1.198000 0.087133 13.749044 -0.202173 0.464831
3 100017 0.0 1 225000.0 918468.0 28966.5 697500.0 0.016612 14086 -3028.0 -643.0 -4911 1 1 0 1 0 0 3.0 2 2 13 0 0 0 0 0 0 0.566907 0.770087 0.1474 0.0973 0.9806 0.16 0.1379 0.3333 0.0931 0.1397 0.0000 0.1502 0.1010 0.9806 0.1611 0.1379 0.3333 0.0952 0.1456 0.0000 0.1489 0.0973 0.9806 0.16 0.1379 0.3333 0.0947 0.1422 0.0000 0.1417 0.0 0.0 0.0 0.0 -4.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.000000 0.0 1.000000 39 6.0 2.0 4.082080 0.128740 31.707938 -0.214965 0.593385
4 100018 0.0 0 189000.0 773680.5 32778.0 679500.0 0.010006 14583 -203.0 -615.0 -2056 1 1 0 1 0 0 2.0 2 1 9 0 0 0 0 0 0 0.642656 0.663407 0.3495 0.1335 0.9985 0.40 0.1724 0.6667 0.1758 0.3774 0.1001 0.3561 0.1386 0.9985 0.4028 0.1724 0.6667 0.1798 0.3932 0.1060 0.3529 0.1335 0.9985 0.40 0.1724 0.6667 0.1789 0.3842 0.1022 0.3811 0.0 0.0 0.0 0.0 -188.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.000000 0.0 0.500000 40 4.0 4.0 4.093548 0.173429 23.603652 -0.013920 0.721940
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
154802 456251 0.0 0 157500.0 254700.0 27558.0 225000.0 0.032561 9327 -236.0 -8456.0 -1982 1 1 0 1 0 0 1.0 1 1 15 0 0 0 0 0 0 0.681632 0.567741 0.2021 0.0887 0.9876 0.22 0.1034 0.6042 0.0594 0.1965 0.1095 0.1008 0.0172 0.9782 0.0806 0.0345 0.4583 0.0094 0.0853 0.0125 0.2040 0.0887 0.9876 0.22 0.1034 0.6042 0.0605 0.2001 0.1118 0.2898 0.0 0.0 0.0 0.0 -273.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.333333 0.0 0.0 1.666667 0.0 0.333333 26 10.0 1.0 1.617143 0.174971 9.242325 -0.025303 0.145570
154803 456252 0.0 0 72000.0 269550.0 12001.5 225000.0 0.025164 20775 -4078.5 -4388.0 -4090 1 0 0 1 1 0 1.0 2 2 8 0 0 0 0 0 0 0.115992 0.393300 0.0247 0.0435 0.9727 0.00 0.1034 0.0833 0.0579 0.0257 0.0000 0.0252 0.0451 0.9727 0.0000 0.1034 0.0833 0.0592 0.0267 0.0000 0.0250 0.0435 0.9727 0.00 0.1034 0.0833 0.0589 0.0261 0.0000 0.0214 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.666667 0.0 0.0 1.333333 0.0 0.666667 57 7.0 1.0 3.743750 0.166687 22.459693 -0.277285 0.444798
154804 456253 0.0 0 153000.0 677664.0 29979.0 585000.0 0.005002 14966 -7921.0 -6737.0 -5150 1 1 0 1 0 1 1.0 3 3 9 0 0 0 0 1 1 0.535722 0.218859 0.1031 0.0862 0.9816 0.00 0.2069 0.1667 0.0579 0.9279 0.0000 0.1050 0.0894 0.9816 0.0000 0.2069 0.1667 0.0592 0.9667 0.0000 0.1041 0.0862 0.9816 0.00 0.2069 0.1667 0.0589 0.9445 0.0000 0.7970 6.0 0.0 6.0 0.0 -1909.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.000000 0.0 0.0 1.000000 0.0 1.000000 41 4.0 2.0 4.429176 0.195941 22.604623 -0.529266 0.744026
154805 456254 1.0 0 171000.0 370107.0 20205.0 319500.0 0.005313 11961 -4786.0 -2562.0 -931 1 1 0 1 0 0 2.0 2 2 9 0 0 0 1 1 0 0.514163 0.661024 0.0124 0.0694 0.9771 0.04 0.0690 0.0417 0.0579 0.0061 0.0000 0.0126 0.0720 0.9772 0.0403 0.0690 0.0417 0.0592 0.0063 0.0000 0.0125 0.0694 0.9771 0.04 0.0690 0.0417 0.0589 0.0062 0.0000 0.0086 0.0 0.0 0.0 0.0 -322.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.000000 0.0 0.000000 33 1.0 2.0 2.164368 0.118158 18.317595 -0.400134 0.739243
154806 456255 0.0 0 157500.0 675000.0 49117.5 675000.0 0.046220 16856 -1262.0 -5128.0 -410 1 1 1 1 1 0 2.0 1 1 20 0 0 0 0 1 1 0.708569 0.113922 0.0742 0.0526 0.9881 0.08 0.0690 0.3750 0.0579 0.0791 0.0000 0.0756 0.0546 0.9881 0.0806 0.0690 0.3750 0.0592 0.0824 0.0000 0.0749 0.0526 0.9881 0.08 0.0690 0.3750 0.0589 0.0805 0.0000 0.0718 0.0 0.0 0.0 0.0 -787.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 2.000000 0.0 1.000000 46 11.0 8.0 4.285714 0.311857 13.742556 -0.074869 0.734460

154807 rows × 97 columns

Imputation colonne catégorielle par son mode

In [ ]:
df_categ_train = df_categ_train.apply(lambda x:x.fillna(x.value_counts().index[0]))
In [ ]:
app_train_final = pd.concat([df_num_imputed_train, df_categ_train], axis=1)
app_train_final
Out[ ]:
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG LIVINGAREA_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE LANDAREA_MODE LIVINGAREA_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI LIVINGAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR AGE PREVIOUS_APPLICATION_COUNT PREVIOUS_LOANS_COUNT CREDIT_PERCENT_INCOME ANNUITY_CREDIT_PERCENT_INCOME CREDIT_REFUND_TIME DAYS_EMPLOYED_PERCENT EXT_SOURCE_1 NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE WEEKDAY_APPR_PROCESS_START ORGANIZATION_TYPE HOUSETYPE_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE MOST_CREDIT_TYPE
0 100002 1.0 0 202500.0 406597.5 24700.5 351000.0 0.018801 9461 -637.0 -3648.0 -2120 1 1 0 1 1 0 1.0 2 2 10 0 0 0 0 0 0 0.262949 0.139376 0.0247 0.0369 0.9722 0.00 0.0690 0.0833 0.0369 0.0190 0.0000 0.0252 0.0383 0.9722 0.0000 0.0690 0.0833 0.0377 0.0198 0.0000 0.0250 0.0369 0.9722 0.00 0.0690 0.0833 0.0375 0.0193 0.0000 0.0149 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.000000 0.0 1.000000 26 8.0 1.0 2.007889 0.121978 16.461104 -0.067329 0.083037 Cash_loans M N Y Unaccompanied Working Secondary_/_secondary_special Single_/_not_married House_/_apartment WEDNESDAY Business_Entity_Type_3 block_of_flats Stone,_brick No Consumer_credit
1 100003 0.0 0 270000.0 1293502.5 35698.5 1129500.0 0.003541 16765 -1188.0 -1186.0 -291 1 1 0 1 1 0 2.0 1 1 11 0 0 0 0 0 0 0.622246 0.158014 0.0959 0.0529 0.9851 0.08 0.0345 0.2917 0.0130 0.0549 0.0098 0.0924 0.0538 0.9851 0.0806 0.0345 0.2917 0.0128 0.0554 0.0000 0.0968 0.0529 0.9851 0.08 0.0345 0.2917 0.0132 0.0558 0.0100 0.0714 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.000000 0.0 0.000000 46 4.0 3.0 4.790750 0.132217 36.234085 -0.070862 0.311267 Cash_loans F N N Family State_servant Higher_education Married House_/_apartment MONDAY School block_of_flats Block No Credit_card
2 100016 0.0 0 67500.0 80865.0 5881.5 67500.0 0.031329 13439 -2717.0 -311.0 -3227 1 1 1 1 1 0 2.0 2 2 10 0 0 0 0 0 0 0.715042 0.176653 0.0825 0.0751 0.9811 0.00 0.2069 0.1667 0.0135 0.0778 0.0000 0.0840 0.0774 0.9811 0.0000 0.2069 0.1667 0.0138 0.0810 0.0000 0.0833 0.0751 0.9811 0.00 0.2069 0.1667 0.0137 0.0792 0.0000 0.0612 0.0 0.0 0.0 0.0 -2370.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 1.000000 0.0 0.000000 37 7.0 4.0 1.198000 0.087133 13.749044 -0.202173 0.464831 Cash_loans F N Y Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment FRIDAY Business_Entity_Type_2 block_of_flats Panel No Consumer_credit
3 100017 0.0 1 225000.0 918468.0 28966.5 697500.0 0.016612 14086 -3028.0 -643.0 -4911 1 1 0 1 0 0 3.0 2 2 13 0 0 0 0 0 0 0.566907 0.770087 0.1474 0.0973 0.9806 0.16 0.1379 0.3333 0.0931 0.1397 0.0000 0.1502 0.1010 0.9806 0.1611 0.1379 0.3333 0.0952 0.1456 0.0000 0.1489 0.0973 0.9806 0.16 0.1379 0.3333 0.0947 0.1422 0.0000 0.1417 0.0 0.0 0.0 0.0 -4.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.000000 0.0 1.000000 39 6.0 2.0 4.082080 0.128740 31.707938 -0.214965 0.593385 Cash_loans M Y N Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment THURSDAY Self-employed block_of_flats Panel No Consumer_credit
4 100018 0.0 0 189000.0 773680.5 32778.0 679500.0 0.010006 14583 -203.0 -615.0 -2056 1 1 0 1 0 0 2.0 2 1 9 0 0 0 0 0 0 0.642656 0.663407 0.3495 0.1335 0.9985 0.40 0.1724 0.6667 0.1758 0.3774 0.1001 0.3561 0.1386 0.9985 0.4028 0.1724 0.6667 0.1798 0.3932 0.1060 0.3529 0.1335 0.9985 0.40 0.1724 0.6667 0.1789 0.3842 0.1022 0.3811 0.0 0.0 0.0 0.0 -188.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.000000 0.0 0.500000 40 4.0 4.0 4.093548 0.173429 23.603652 -0.013920 0.721940 Cash_loans F N Y Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment MONDAY Transport:_type_2 block_of_flats Panel No Consumer_credit
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
154802 456251 0.0 0 157500.0 254700.0 27558.0 225000.0 0.032561 9327 -236.0 -8456.0 -1982 1 1 0 1 0 0 1.0 1 1 15 0 0 0 0 0 0 0.681632 0.567741 0.2021 0.0887 0.9876 0.22 0.1034 0.6042 0.0594 0.1965 0.1095 0.1008 0.0172 0.9782 0.0806 0.0345 0.4583 0.0094 0.0853 0.0125 0.2040 0.0887 0.9876 0.22 0.1034 0.6042 0.0605 0.2001 0.1118 0.2898 0.0 0.0 0.0 0.0 -273.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.333333 0.0 0.0 1.666667 0.0 0.333333 26 10.0 1.0 1.617143 0.174971 9.242325 -0.025303 0.145570 Cash_loans M N N Unaccompanied Working Secondary_/_secondary_special Separated With_parents THURSDAY Services block_of_flats Stone,_brick No Consumer_credit
154803 456252 0.0 0 72000.0 269550.0 12001.5 225000.0 0.025164 20775 -4078.5 -4388.0 -4090 1 0 0 1 1 0 1.0 2 2 8 0 0 0 0 0 0 0.115992 0.393300 0.0247 0.0435 0.9727 0.00 0.1034 0.0833 0.0579 0.0257 0.0000 0.0252 0.0451 0.9727 0.0000 0.1034 0.0833 0.0592 0.0267 0.0000 0.0250 0.0435 0.9727 0.00 0.1034 0.0833 0.0589 0.0261 0.0000 0.0214 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.666667 0.0 0.0 1.333333 0.0 0.666667 57 7.0 1.0 3.743750 0.166687 22.459693 -0.277285 0.444798 Cash_loans F N Y Unaccompanied Pensioner Secondary_/_secondary_special Widow House_/_apartment MONDAY Business_Entity_Type_3 block_of_flats Stone,_brick No Consumer_credit
154804 456253 0.0 0 153000.0 677664.0 29979.0 585000.0 0.005002 14966 -7921.0 -6737.0 -5150 1 1 0 1 0 1 1.0 3 3 9 0 0 0 0 1 1 0.535722 0.218859 0.1031 0.0862 0.9816 0.00 0.2069 0.1667 0.0579 0.9279 0.0000 0.1050 0.0894 0.9816 0.0000 0.2069 0.1667 0.0592 0.9667 0.0000 0.1041 0.0862 0.9816 0.00 0.2069 0.1667 0.0589 0.9445 0.0000 0.7970 6.0 0.0 6.0 0.0 -1909.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.000000 0.0 0.0 1.000000 0.0 1.000000 41 4.0 2.0 4.429176 0.195941 22.604623 -0.529266 0.744026 Cash_loans F N Y Unaccompanied Working Higher_education Separated House_/_apartment THURSDAY School block_of_flats Panel No Consumer_credit
154805 456254 1.0 0 171000.0 370107.0 20205.0 319500.0 0.005313 11961 -4786.0 -2562.0 -931 1 1 0 1 0 0 2.0 2 2 9 0 0 0 1 1 0 0.514163 0.661024 0.0124 0.0694 0.9771 0.04 0.0690 0.0417 0.0579 0.0061 0.0000 0.0126 0.0720 0.9772 0.0403 0.0690 0.0417 0.0592 0.0063 0.0000 0.0125 0.0694 0.9771 0.04 0.0690 0.0417 0.0589 0.0062 0.0000 0.0086 0.0 0.0 0.0 0.0 -322.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.000000 0.0 0.000000 33 1.0 2.0 2.164368 0.118158 18.317595 -0.400134 0.739243 Cash_loans F N Y Unaccompanied Commercial_associate Secondary_/_secondary_special Married House_/_apartment WEDNESDAY Business_Entity_Type_1 block_of_flats Stone,_brick No Consumer_credit
154806 456255 0.0 0 157500.0 675000.0 49117.5 675000.0 0.046220 16856 -1262.0 -5128.0 -410 1 1 1 1 1 0 2.0 1 1 20 0 0 0 0 1 1 0.708569 0.113922 0.0742 0.0526 0.9881 0.08 0.0690 0.3750 0.0579 0.0791 0.0000 0.0756 0.0546 0.9881 0.0806 0.0690 0.3750 0.0592 0.0824 0.0000 0.0749 0.0526 0.9881 0.08 0.0690 0.3750 0.0589 0.0805 0.0000 0.0718 0.0 0.0 0.0 0.0 -787.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 2.000000 0.0 1.000000 46 11.0 8.0 4.285714 0.311857 13.742556 -0.074869 0.734460 Cash_loans F N N Unaccompanied Commercial_associate Higher_education Married House_/_apartment THURSDAY Business_Entity_Type_3 block_of_flats Panel No Consumer_credit

154807 rows × 112 columns

In [ ]:
app_train_final = app_train_final.reset_index()
del app_train_final['index']

TEST

In [ ]:
# Df des features numériques.
df_num_test = app_test_reduced.select_dtypes('number').reset_index(drop = True)

# Df des features catégoriques.
df_categ_test = app_test_reduced.select_dtypes('object').reset_index(drop = True)
In [ ]:
df_num_imputed_test = imputation_pandas(df_num_test)
df_num_imputed_test
Out[ ]:
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG LIVINGAREA_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE LANDAREA_MODE LIVINGAREA_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI LIVINGAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR AGE PREVIOUS_APPLICATION_COUNT PREVIOUS_LOANS_COUNT CREDIT_PERCENT_INCOME ANNUITY_CREDIT_PERCENT_INCOME CREDIT_REFUND_TIME DAYS_EMPLOYED_PERCENT EXT_SOURCE_1
0 100001 0 135000.0 568800.0 20560.5 450000.0 0.018850 -19241 -2329.000000 -5170.0 -812 1 1 0 1 0 1 2.0 2 2 18 0 0 0 0 0 0 0.789654 0.159520 0.0660 0.0590 0.9732 0.32 0.13790 0.1250 0.2042 0.05050 0.080000 0.0672 0.061200 0.9732 0.3222 0.13790 0.1250 0.2089 0.0526 0.084700 0.0666 0.0590 0.9732 0.32 0.13790 0.1250 0.2078 0.05140 0.081700 0.0392 0.0 0.0 0.0 0.0 -1740.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 53 7.0 1.0 4.213333 0.152300 27.664697 0.121044 0.752614
1 100028 2 315000.0 1575000.0 49018.5 1575000.0 0.026392 -13976 -1866.000000 -2000.0 -4208 1 1 0 1 1 0 4.0 2 2 11 0 0 0 0 0 0 0.509677 0.612704 0.3052 0.1974 0.9970 0.32 0.27590 0.3750 0.2042 0.36730 0.080000 0.3109 0.204900 0.9970 0.3222 0.27590 0.3750 0.2089 0.3827 0.084700 0.3081 0.1974 0.9970 0.32 0.27590 0.3750 0.2078 0.37390 0.081700 0.3700 0.0 0.0 0.0 0.0 -1805.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0 38 12.0 5.0 5.000000 0.155614 32.130726 0.133515 0.525734
2 100042 0 270000.0 959688.0 34600.5 810000.0 0.025164 -18604 -12009.000000 -6116.0 -2027 1 1 0 1 1 0 2.0 2 2 15 0 0 0 0 0 0 0.628904 0.392774 0.2412 0.0084 0.9821 0.16 0.13790 0.3333 0.1683 0.22180 0.073100 0.2458 0.008800 0.9821 0.1611 0.13790 0.3333 0.1721 0.2311 0.077400 0.2436 0.0084 0.9821 0.16 0.13790 0.3333 0.1712 0.22580 0.074600 0.2151 0.0 0.0 0.0 0.0 -1705.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 1.0 2.0 51 14.0 9.0 3.554400 0.128150 27.736247 0.645506 0.622121
3 100066 0 315000.0 364896.0 28957.5 315000.0 0.046220 -12744 -1013.000000 -1686.0 -3171 1 1 0 1 0 0 2.0 1 1 18 0 0 0 0 0 0 0.808788 0.522697 0.1031 0.1115 0.9781 0.00 0.20690 0.1667 0.0889 0.11655 0.048967 0.1050 0.115700 0.9782 0.0000 0.20690 0.1667 0.0909 0.1214 0.051867 0.1041 0.1115 0.9781 0.00 0.20690 0.1667 0.0904 0.11865 0.049967 0.0702 0.0 0.0 0.0 0.0 -829.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 5.0 35 3.0 9.0 1.158400 0.091929 12.601088 0.079488 0.718507
4 100074 0 67500.0 675000.0 25447.5 675000.0 0.003122 -23670 -1010.000000 -7490.0 -4136 1 0 0 1 1 0 2.0 3 3 11 0 0 0 0 0 0 0.660015 0.298595 0.0216 0.0545 0.9781 0.00 0.10340 0.0417 0.0095 0.01130 0.024833 0.0221 0.056600 0.9782 0.0000 0.10340 0.0417 0.0097 0.0117 0.026333 0.0219 0.0545 0.9781 0.00 0.10340 0.0417 0.0096 0.01150 0.025333 0.0136 0.0 0.0 0.0 0.0 -1671.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 65 3.0 2.0 10.000000 0.377000 26.525199 0.076867 0.531336
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25340 456170 0 157500.0 500490.0 48888.0 450000.0 0.006671 -21780 -1307.333333 -10745.0 -5249 1 0 0 1 1 0 1.0 2 2 11 0 0 0 0 0 0 0.471719 0.631355 0.0433 0.0527 0.9851 0.00 0.10340 0.1667 0.0368 0.04220 0.002400 0.0441 0.054700 0.9851 0.0000 0.10340 0.1667 0.0376 0.0440 0.002600 0.0437 0.0527 0.9851 0.00 0.10340 0.1667 0.0374 0.04300 0.002500 0.0377 0.0 0.0 0.0 0.0 -611.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0 60 3.0 2.0 3.177714 0.310400 10.237482 0.065481 0.851722
25341 456189 0 270000.0 360000.0 28570.5 360000.0 0.026392 -19397 -119.000000 -4386.0 -2945 1 1 0 1 0 0 1.0 2 2 12 0 0 0 0 0 0 0.689832 0.255332 0.1216 0.0806 0.9935 0.12 0.10340 0.3750 0.0368 0.15320 0.000000 0.1239 0.083633 0.9935 0.1208 0.10340 0.3750 0.0376 0.1596 0.000000 0.1228 0.0806 0.9935 0.12 0.10340 0.3750 0.0374 0.15590 0.000000 0.1205 3.0 0.0 3.0 0.0 -1252.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0 53 6.0 11.0 1.333333 0.105817 12.600410 0.006135 0.442558
25342 456202 3 135000.0 252022.5 23112.0 217561.5 0.009175 -11708 -369.000000 -174.0 -4178 1 1 0 1 1 0 5.0 2 2 16 0 0 0 0 0 0 0.762352 0.240541 0.0227 0.1085 0.9786 0.00 0.12065 0.0417 0.0368 0.01710 0.027100 0.0231 0.112567 0.9786 0.0000 0.12065 0.0417 0.0376 0.0178 0.028700 0.0229 0.1085 0.9786 0.00 0.12065 0.0417 0.0374 0.01740 0.027700 0.0134 0.0 0.0 0.0 0.0 -987.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 2.0 2.0 32 5.0 6.0 1.866833 0.171200 10.904400 0.031517 0.174671
25343 456223 1 202500.0 315000.0 33205.5 315000.0 0.026392 -15922 -3037.000000 -2681.0 -1504 1 1 0 1 1 0 3.0 2 2 12 0 0 0 0 0 0 0.632770 0.283712 0.1113 0.1364 0.9955 0.16 0.13790 0.3333 0.0368 0.13830 0.054200 0.1134 0.141500 0.9955 0.1611 0.13790 0.3333 0.0376 0.1441 0.057400 0.1124 0.1364 0.9955 0.16 0.13790 0.3333 0.0374 0.14080 0.055400 0.1663 0.0 0.0 0.0 0.0 -838.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 3.0 1.0 44 5.0 2.0 1.555556 0.163978 9.486380 0.190742 0.733503
25344 456224 0 225000.0 450000.0 25128.0 450000.0 0.018850 -13968 -2731.000000 -1461.0 -1364 1 1 1 1 1 0 2.0 2 2 10 0 1 1 0 1 1 0.445701 0.595456 0.1629 0.0723 0.9896 0.16 0.06900 0.6250 0.0368 0.15630 0.149000 0.1660 0.075000 0.9896 0.1611 0.06900 0.6250 0.0376 0.1204 0.157700 0.1645 0.0723 0.9896 0.16 0.06900 0.6250 0.0374 0.15910 0.152100 0.1974 0.0 0.0 0.0 0.0 -2308.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0 38 17.0 5.0 2.000000 0.111680 17.908309 0.195518 0.373090

25345 rows × 96 columns

In [ ]:
df_categ_test = df_categ_test.apply(lambda x:x.fillna(x.value_counts().index[0]))
In [ ]:
app_test_final = pd.concat([df_num_imputed_test, df_categ_test], axis=1)
app_test_final
Out[ ]:
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG LIVINGAREA_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE LANDAREA_MODE LIVINGAREA_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI LIVINGAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR AGE PREVIOUS_APPLICATION_COUNT PREVIOUS_LOANS_COUNT CREDIT_PERCENT_INCOME ANNUITY_CREDIT_PERCENT_INCOME CREDIT_REFUND_TIME DAYS_EMPLOYED_PERCENT EXT_SOURCE_1 NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE WEEKDAY_APPR_PROCESS_START ORGANIZATION_TYPE HOUSETYPE_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE MOST_CREDIT_TYPE
0 100001 0 135000.0 568800.0 20560.5 450000.0 0.018850 -19241 -2329.000000 -5170.0 -812 1 1 0 1 0 1 2.0 2 2 18 0 0 0 0 0 0 0.789654 0.159520 0.0660 0.0590 0.9732 0.32 0.13790 0.1250 0.2042 0.05050 0.080000 0.0672 0.061200 0.9732 0.3222 0.13790 0.1250 0.2089 0.0526 0.084700 0.0666 0.0590 0.9732 0.32 0.13790 0.1250 0.2078 0.05140 0.081700 0.0392 0.0 0.0 0.0 0.0 -1740.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 53 7.0 1.0 4.213333 0.152300 27.664697 0.121044 0.752614 Cash_loans F N Y Unaccompanied Working Higher_education Married House_/_apartment TUESDAY Kindergarten block_of_flats Stone,_brick No Consumer_credit
1 100028 2 315000.0 1575000.0 49018.5 1575000.0 0.026392 -13976 -1866.000000 -2000.0 -4208 1 1 0 1 1 0 4.0 2 2 11 0 0 0 0 0 0 0.509677 0.612704 0.3052 0.1974 0.9970 0.32 0.27590 0.3750 0.2042 0.36730 0.080000 0.3109 0.204900 0.9970 0.3222 0.27590 0.3750 0.2089 0.3827 0.084700 0.3081 0.1974 0.9970 0.32 0.27590 0.3750 0.2078 0.37390 0.081700 0.3700 0.0 0.0 0.0 0.0 -1805.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0 38 12.0 5.0 5.000000 0.155614 32.130726 0.133515 0.525734 Cash_loans F N Y Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment WEDNESDAY Business_Entity_Type_3 block_of_flats Panel No Consumer_credit
2 100042 0 270000.0 959688.0 34600.5 810000.0 0.025164 -18604 -12009.000000 -6116.0 -2027 1 1 0 1 1 0 2.0 2 2 15 0 0 0 0 0 0 0.628904 0.392774 0.2412 0.0084 0.9821 0.16 0.13790 0.3333 0.1683 0.22180 0.073100 0.2458 0.008800 0.9821 0.1611 0.13790 0.3333 0.1721 0.2311 0.077400 0.2436 0.0084 0.9821 0.16 0.13790 0.3333 0.1712 0.22580 0.074600 0.2151 0.0 0.0 0.0 0.0 -1705.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 1.0 2.0 51 14.0 9.0 3.554400 0.128150 27.736247 0.645506 0.622121 Cash_loans F Y Y Unaccompanied State_servant Secondary_/_secondary_special Married House_/_apartment MONDAY Government block_of_flats Block No Consumer_credit
3 100066 0 315000.0 364896.0 28957.5 315000.0 0.046220 -12744 -1013.000000 -1686.0 -3171 1 1 0 1 0 0 2.0 1 1 18 0 0 0 0 0 0 0.808788 0.522697 0.1031 0.1115 0.9781 0.00 0.20690 0.1667 0.0889 0.11655 0.048967 0.1050 0.115700 0.9782 0.0000 0.20690 0.1667 0.0909 0.1214 0.051867 0.1041 0.1115 0.9781 0.00 0.20690 0.1667 0.0904 0.11865 0.049967 0.0702 0.0 0.0 0.0 0.0 -829.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 5.0 35 3.0 9.0 1.158400 0.091929 12.601088 0.079488 0.718507 Cash_loans F N Y Unaccompanied State_servant Higher_education Married House_/_apartment THURSDAY School block_of_flats Stone,_brick No Consumer_credit
4 100074 0 67500.0 675000.0 25447.5 675000.0 0.003122 -23670 -1010.000000 -7490.0 -4136 1 0 0 1 1 0 2.0 3 3 11 0 0 0 0 0 0 0.660015 0.298595 0.0216 0.0545 0.9781 0.00 0.10340 0.0417 0.0095 0.01130 0.024833 0.0221 0.056600 0.9782 0.0000 0.10340 0.0417 0.0097 0.0117 0.026333 0.0219 0.0545 0.9781 0.00 0.10340 0.0417 0.0096 0.01150 0.025333 0.0136 0.0 0.0 0.0 0.0 -1671.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 65 3.0 2.0 10.000000 0.377000 26.525199 0.076867 0.531336 Cash_loans F N Y Unaccompanied Pensioner Secondary_/_secondary_special Married House_/_apartment TUESDAY Business_Entity_Type_3 block_of_flats Stone,_brick No Consumer_credit
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25340 456170 0 157500.0 500490.0 48888.0 450000.0 0.006671 -21780 -1307.333333 -10745.0 -5249 1 0 0 1 1 0 1.0 2 2 11 0 0 0 0 0 0 0.471719 0.631355 0.0433 0.0527 0.9851 0.00 0.10340 0.1667 0.0368 0.04220 0.002400 0.0441 0.054700 0.9851 0.0000 0.10340 0.1667 0.0376 0.0440 0.002600 0.0437 0.0527 0.9851 0.00 0.10340 0.1667 0.0374 0.04300 0.002500 0.0377 0.0 0.0 0.0 0.0 -611.0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0 60 3.0 2.0 3.177714 0.310400 10.237482 0.065481 0.851722 Cash_loans F Y Y Children Pensioner Secondary_/_secondary_special Single_/_not_married House_/_apartment WEDNESDAY Business_Entity_Type_3 block_of_flats Panel No Consumer_credit
25341 456189 0 270000.0 360000.0 28570.5 360000.0 0.026392 -19397 -119.000000 -4386.0 -2945 1 1 0 1 0 0 1.0 2 2 12 0 0 0 0 0 0 0.689832 0.255332 0.1216 0.0806 0.9935 0.12 0.10340 0.3750 0.0368 0.15320 0.000000 0.1239 0.083633 0.9935 0.1208 0.10340 0.3750 0.0376 0.1596 0.000000 0.1228 0.0806 0.9935 0.12 0.10340 0.3750 0.0374 0.15590 0.000000 0.1205 3.0 0.0 3.0 0.0 -1252.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0 53 6.0 11.0 1.333333 0.105817 12.600410 0.006135 0.442558 Cash_loans F N Y Unaccompanied Commercial_associate Secondary_/_secondary_special Separated Rented_apartment SUNDAY Business_Entity_Type_3 block_of_flats Stone,_brick No Consumer_credit
25342 456202 3 135000.0 252022.5 23112.0 217561.5 0.009175 -11708 -369.000000 -174.0 -4178 1 1 0 1 1 0 5.0 2 2 16 0 0 0 0 0 0 0.762352 0.240541 0.0227 0.1085 0.9786 0.00 0.12065 0.0417 0.0368 0.01710 0.027100 0.0231 0.112567 0.9786 0.0000 0.12065 0.0417 0.0376 0.0178 0.028700 0.0229 0.1085 0.9786 0.00 0.12065 0.0417 0.0374 0.01740 0.027700 0.0134 0.0 0.0 0.0 0.0 -987.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 2.0 2.0 32 5.0 6.0 1.866833 0.171200 10.904400 0.031517 0.174671 Cash_loans F Y N Unaccompanied Working Secondary_/_secondary_special Civil_marriage House_/_apartment TUESDAY Self-employed block_of_flats Panel No Credit_card
25343 456223 1 202500.0 315000.0 33205.5 315000.0 0.026392 -15922 -3037.000000 -2681.0 -1504 1 1 0 1 1 0 3.0 2 2 12 0 0 0 0 0 0 0.632770 0.283712 0.1113 0.1364 0.9955 0.16 0.13790 0.3333 0.0368 0.13830 0.054200 0.1134 0.141500 0.9955 0.1611 0.13790 0.3333 0.0376 0.1441 0.057400 0.1124 0.1364 0.9955 0.16 0.13790 0.3333 0.0374 0.14080 0.055400 0.1663 0.0 0.0 0.0 0.0 -838.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 3.0 1.0 44 5.0 2.0 1.555556 0.163978 9.486380 0.190742 0.733503 Cash_loans F Y Y Unaccompanied Commercial_associate Secondary_/_secondary_special Married House_/_apartment WEDNESDAY Business_Entity_Type_3 block_of_flats Stone,_brick No Consumer_credit
25344 456224 0 225000.0 450000.0 25128.0 450000.0 0.018850 -13968 -2731.000000 -1461.0 -1364 1 1 1 1 1 0 2.0 2 2 10 0 1 1 0 1 1 0.445701 0.595456 0.1629 0.0723 0.9896 0.16 0.06900 0.6250 0.0368 0.15630 0.149000 0.1660 0.075000 0.9896 0.1611 0.06900 0.6250 0.0376 0.1204 0.157700 0.1645 0.0723 0.9896 0.16 0.06900 0.6250 0.0374 0.15910 0.152100 0.1974 0.0 0.0 0.0 0.0 -2308.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0 38 17.0 5.0 2.000000 0.111680 17.908309 0.195518 0.373090 Cash_loans M N N Family Commercial_associate Higher_education Married House_/_apartment MONDAY Self-employed block_of_flats Panel No Consumer_credit

25345 rows × 111 columns

In [ ]:
app_test_final = app_test_final.reset_index()
del app_test_final['index']
In [ ]:
app_train_final['SK_ID_CURR']=app_train_final['SK_ID_CURR'].astype('object')
app_train_final['TARGET']=app_train_final['TARGET'].astype('object')
app_test_final['SK_ID_CURR']=app_test_final['SK_ID_CURR'].astype('object')

Effet de l'imputation sur Days_employed

In [ ]:
#app_train_final['DAYS_EMPLOYED'].replace(np.nan, app_train_final['DAYS_EMPLOYED'].mean(), inplace = True)
#app_test_final['DAYS_EMPLOYED'].replace(np.nan, app_test_final['DAYS_EMPLOYED'].mean(), inplace = True)
In [ ]:
plt.figure(figsize=(15,10))
sns.distplot(app_train_final['DAYS_EMPLOYED'], hist=True, rug=True, bins=25)
sns.distplot(app_test_final['DAYS_EMPLOYED'], hist=True, rug=True, bins=25)
plt.title('Histogram of DAYS_EMPLOYED after replacing nan with mean of variable for the train and test set',
          weight='bold', size=18)
plt.xlabel('Days Employment', weight="bold")
labels= ["Train", "Test"]
plt.legend(labels)
plt.show()

Standardisation (Train + Test)

In [ ]:
df_num_train = app_train_final.select_dtypes(['number']).reset_index(drop = True)
df_categ_train = app_train_final.select_dtypes('object').reset_index(drop = True)

df_num_test = app_test_final.select_dtypes(['number']).reset_index(drop = True)
df_categ_test = app_test_final.select_dtypes('object').reset_index(drop = True)
In [ ]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(df_num_train)


df_num_train_train = pd.DataFrame(scaler.transform(df_num_train), index=df_num_train.index, columns=df_num_train.columns)
df_num_test_test = pd.DataFrame(scaler.transform(df_num_test), index=df_num_test.index, columns=df_num_test.columns)
In [ ]:
df_app_train = pd.concat([df_categ_train, df_num_train_train], axis=1)
df_app_test = pd.concat([df_categ_test, df_num_test_test], axis=1)
In [ ]:
df_app_train
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE WEEKDAY_APPR_PROCESS_START ORGANIZATION_TYPE HOUSETYPE_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE MOST_CREDIT_TYPE CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG LIVINGAREA_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE LANDAREA_MODE LIVINGAREA_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI LIVINGAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR AGE PREVIOUS_APPLICATION_COUNT PREVIOUS_LOANS_COUNT CREDIT_PERCENT_INCOME ANNUITY_CREDIT_PERCENT_INCOME CREDIT_REFUND_TIME DAYS_EMPLOYED_PERCENT EXT_SOURCE_1
0 100002 1.0 Cash_loans M N Y Unaccompanied Working Secondary_/_secondary_special Single_/_not_married House_/_apartment WEDNESDAY Business_Entity_Type_3 block_of_flats Stone,_brick No Consumer_credit -0.573222 0.069058 -0.513860 -0.222911 -0.544493 -0.226831 -1.521797 0.784979 0.435145 0.592401 0.002542 0.458151 -0.46321 0.047807 1.490005 -0.267116 -1.234433 0.032789 0.100683 -0.691452 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 -1.486774 -2.015710 -0.865735 -0.653469 -0.119336 -0.597191 -0.811059 -0.994768 -0.380311 -0.809814 -0.419175 -0.833449 -0.610656 -0.095632 -0.573400 -0.758313 -0.973308 -0.349750 -0.779994 -0.395464 -0.860099 -0.649427 -0.117949 -0.591143 -0.803313 -0.989141 -0.378838 -0.805722 -0.413749 -0.823533 0.251202 4.215860 0.259889 5.334693 -0.171217 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 -0.005683 -0.068832 -0.003594 -0.066765 -0.06598 -0.038405 -0.110307 -0.018852 -0.095185 -0.026299 -0.025168 -0.018852 -0.082491 -0.067154 -0.172916 -0.309610 -0.299344 -0.487533 -1.514272 0.505080 -0.944343 -0.703497 -0.579286 -0.663956 0.707019 -2.378372
1 100003 0.0 Cash_loans F N N Family State_servant Higher_education Married House_/_apartment MONDAY School block_of_flats Block No Credit_card -0.573222 0.282931 1.597394 0.500488 1.467988 -1.206771 0.151362 0.546259 1.116256 1.801065 0.002542 0.458151 -0.46321 0.047807 1.490005 -0.267116 -0.108965 -1.821758 -1.805455 -0.390994 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.453341 -1.915166 -0.204086 -0.450753 0.124243 0.008392 -1.157125 0.449491 -0.688567 -0.483594 -0.274421 -0.207225 -0.418553 0.123118 0.047529 -1.101191 0.480237 -0.668697 -0.460213 -0.395464 -0.198015 -0.446030 0.122137 0.014939 -1.148270 0.450812 -0.688457 -0.479085 -0.267304 -0.299918 -0.156823 -0.313846 -0.151666 -0.269338 0.191540 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 -0.005683 -0.068832 -0.003594 -0.066765 -0.06598 -0.038405 -0.110307 -0.018852 -0.095185 -0.026299 -0.025168 -0.018852 -0.082491 -0.067154 -0.172916 -0.309610 -0.299344 -1.036331 0.156858 -0.407513 -0.465422 0.369009 -0.466967 1.833934 0.679968 -1.139076
2 100016 0.0 Cash_loans F N Y Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment FRIDAY Business_Entity_Type_2 block_of_flats Panel No Consumer_credit -0.573222 -0.358688 -1.289258 -1.460739 -1.277362 0.577670 -0.610539 -0.116181 1.358324 -0.139141 0.002542 0.458151 2.15885 0.047807 1.490005 -0.267116 -0.108965 0.032789 0.100683 -0.691452 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.954416 -1.814623 -0.328610 -0.169484 0.048715 -0.597191 0.572202 -0.416788 -0.682118 -0.275504 -0.419175 -0.285503 -0.126060 0.055288 -0.573400 0.612204 -0.391611 -0.655888 -0.230258 -0.395464 -0.322501 -0.163817 0.047692 -0.591143 0.575514 -0.412883 -0.682086 -0.269679 -0.413749 -0.394447 -0.564847 -0.313846 -0.563222 -0.269338 -1.636472 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 -0.005683 -0.068832 -0.003594 -0.066765 -0.06598 -0.038405 -0.110307 -0.018852 -0.095185 -0.026299 -0.025168 -0.018852 -0.082491 -0.067154 -0.172916 0.671071 -0.299344 -1.036331 -0.595150 0.276932 -0.225961 -1.015625 -0.961526 -1.006566 -0.325466 -0.305222
3 100017 0.0 Cash_loans M Y N Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment THURSDAY Self-employed block_of_flats Panel No Consumer_credit 0.890172 0.140349 0.704634 0.057687 0.351236 -0.367401 -0.462328 -0.250921 1.266477 -1.251984 0.002542 0.458151 -0.46321 0.047807 -0.671139 -0.267116 1.016503 0.032789 0.100683 0.209923 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.154523 1.386615 0.274495 0.111785 0.039274 0.613975 -0.119930 0.737788 0.344542 0.286976 -0.419175 0.331402 0.166433 0.046810 0.667688 -0.073551 0.770388 0.386773 0.350019 -0.395464 0.282411 0.118396 0.038386 0.621021 -0.114400 0.738250 0.349977 0.294105 -0.413749 0.351589 -0.564847 -0.313846 -0.563222 -0.269338 1.168377 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 -0.005683 -0.068832 -0.003594 -0.066765 -0.06598 -0.038405 -0.110307 -0.018852 -0.095185 -0.026299 -0.025168 -0.018852 -0.082491 -0.067154 -0.172916 -0.309610 -0.299344 -0.487533 -0.428038 0.048784 -0.704883 0.095890 -0.505105 1.262153 -0.423416 0.392831
4 100018 0.0 Cash_loans F N Y Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment MONDAY Transport:_type_2 block_of_flats Panel No Consumer_credit -0.573222 0.026283 0.359971 0.308390 0.304704 -0.791613 -0.348478 0.973010 1.274223 0.634695 0.002542 0.458151 -0.46321 0.047807 -0.671139 -0.267116 -0.108965 0.032789 -1.805455 -0.991910 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.563552 0.811138 2.152576 0.570430 0.377263 2.430723 0.226136 3.048325 1.411184 2.446935 1.059385 2.250145 0.632438 0.350346 2.529705 0.269326 3.095780 1.470422 2.574112 1.154918 2.163543 0.578582 0.371529 2.439266 0.230557 3.041898 1.422813 2.459752 1.082916 2.570233 -0.564847 -0.313846 -0.563222 -0.269338 0.950249 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 -0.005683 -0.068832 -0.003594 -0.066765 -0.06598 -0.038405 -0.110307 -0.018852 -0.095185 -0.026299 -0.025168 -0.018852 -0.082491 -0.067154 -0.172916 -0.309610 -0.299344 -0.761932 -0.344481 -0.407513 -0.225961 0.100310 -0.014877 0.238351 1.115964 1.090884
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
154802 456251 0.0 Cash_loans M N N Unaccompanied Working Secondary_/_secondary_special Separated With_parents THURSDAY Services block_of_flats Stone,_brick No Consumer_credit -0.573222 -0.073524 -0.875448 -0.034957 -0.870213 0.656785 -1.552493 0.958713 -0.894985 0.683596 0.002542 0.458151 -0.46321 0.047807 -0.671139 -0.267116 -1.234433 -1.821758 -1.805455 0.810840 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.774014 0.295073 0.782813 0.002825 0.171448 1.068162 -0.465996 2.615186 -0.090112 0.803112 1.198231 -0.128947 -0.872165 0.006112 0.047529 -1.101191 1.642235 -0.712248 -0.191633 -0.212636 0.790501 0.009070 0.168666 1.075582 -0.459356 2.610051 -0.085783 0.812250 1.223503 1.724108 -0.564847 -0.313846 -0.563222 -0.269338 0.849483 -0.006725 -1.514819 -0.00843 -0.124743 -0.305616 -0.014821 3.224716 -0.069586 -0.005683 -0.068832 -0.003594 -0.066765 -0.06598 -0.038405 -0.110307 -0.018852 -0.095185 -0.026299 -0.025168 -0.018852 3.848012 -0.067154 -0.172916 1.324858 -0.299344 -0.853399 -1.514272 0.961376 -0.944343 -0.854089 0.002048 -1.575893 1.028809 -2.038814
154803 456252 0.0 Cash_loans F N Y Unaccompanied Pensioner Secondary_/_secondary_special Widow House_/_apartment MONDAY Business_Entity_Type_3 block_of_flats Stone,_brick No Consumer_credit -0.573222 -0.344430 -0.840098 -1.058193 -0.870213 0.181777 1.069950 -0.706050 0.230425 -0.709440 0.002542 -2.182687 -0.46321 0.047807 1.490005 -0.267116 -1.234433 0.032789 0.100683 -1.292369 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 -2.280302 -0.645935 -0.865735 -0.569849 -0.109895 -0.597191 -0.465996 -0.994768 -0.109458 -0.748932 -0.419175 -0.833449 -0.526379 -0.087154 -0.573400 -0.416429 -0.973308 -0.074355 -0.718014 -0.395464 -0.860099 -0.565526 -0.108644 -0.591143 -0.459356 -0.989141 -0.106169 -0.744869 -0.413749 -0.763294 -0.564847 -0.313846 -0.563222 -0.269338 1.173119 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 -0.005683 -0.068832 -0.003594 -0.066765 -0.06598 -0.038405 -0.110307 -0.018852 -0.095185 -0.026299 -0.025168 -0.018852 7.778516 -0.067154 -0.172916 0.997965 -0.299344 -0.670466 1.075979 0.276932 -0.944343 -0.034501 -0.088826 0.093836 -0.900589 -0.414000
154804 456253 0.0 Cash_loans F N Y Unaccompanied Working Higher_education Separated House_/_apartment THURSDAY School block_of_flats Panel No Consumer_credit -0.573222 -0.087782 0.131407 0.124285 0.060415 -1.112951 -0.260742 -2.370813 -0.419425 -1.409923 0.002542 0.458151 -0.46321 0.047807 -0.671139 3.743692 -1.234433 1.887336 2.006822 -0.991910 -0.110786 -0.220085 -0.2025 -0.193282 2.698927 2.958003 -0.013867 -1.586943 -0.137177 -0.028850 0.058156 -0.597191 0.572202 -0.416788 -0.109458 7.449279 -0.419175 -0.089808 0.022665 0.063767 -0.573400 0.612204 -0.391611 -0.074355 7.725638 -0.395464 -0.130700 -0.022710 0.056997 -0.591143 0.575514 -0.412883 -0.106169 7.473853 -0.413749 6.424595 1.883299 -0.313846 1.906110 -0.269338 -1.089965 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 -0.005683 -0.068832 -0.003594 -0.066765 -0.06598 -0.038405 -0.110307 -0.018852 -0.095185 -0.026299 -0.025168 -0.018852 11.709019 -0.067154 -0.172916 0.671071 -0.299344 -0.487533 -0.260925 -0.407513 -0.704883 0.229660 0.232084 0.112145 -2.829987 1.210815
154805 456254 1.0 Cash_loans F N Y Unaccompanied Commercial_associate Secondary_/_secondary_special Married House_/_apartment WEDNESDAY Business_Entity_Type_1 block_of_flats Stone,_brick No Consumer_credit -0.573222 -0.030750 -0.600725 -0.518604 -0.625923 -1.092980 -0.949111 -1.012575 0.735586 1.378132 0.002542 0.458151 -0.46321 0.047807 -0.671139 -0.267116 -0.108965 0.032789 0.100683 -0.991910 -0.110786 -0.220085 -0.2025 5.173787 2.698927 -0.338066 -0.130280 0.798280 -0.980037 -0.241702 -0.026814 -0.294400 -0.811059 -1.283066 -0.109458 -0.927036 -0.419175 -0.950866 -0.192987 -0.010845 -0.262935 -0.758313 -1.263459 -0.074355 -0.901259 -0.395464 -0.975365 -0.236277 -0.026754 -0.288102 -0.803313 -1.276579 -0.106169 -0.922953 -0.413749 -0.881918 -0.564847 -0.313846 -0.563222 -0.269338 0.791394 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 -0.005683 -0.068832 -0.003594 -0.066765 -0.06598 -0.038405 -0.110307 -0.018852 -0.095185 -0.026299 -0.025168 -0.018852 -0.082491 -0.067154 -0.172916 -0.309610 -0.299344 -1.036331 -0.929376 -1.091957 -0.704883 -0.643190 -0.621190 -0.429428 -1.841232 1.184841
154806 456255 0.0 Cash_loans F N N Unaccompanied Commercial_associate Higher_education Married House_/_apartment THURSDAY Business_Entity_Type_3 block_of_flats Panel No Consumer_credit -0.573222 -0.073524 0.125065 1.383128 0.293071 1.533914 0.172208 0.514198 0.025704 1.722426 0.002542 0.458151 2.15885 0.047807 1.490005 -0.267116 -0.108965 -1.821758 -1.805455 2.313132 -0.110786 -0.220085 -0.2025 -0.193282 2.698927 2.958003 0.919464 -2.153016 -0.405740 -0.454554 0.180889 0.008392 -0.811059 1.026778 -0.109458 -0.263691 -0.419175 -0.363781 -0.408638 0.173990 0.047529 -0.758313 1.061236 -0.074355 -0.217682 -0.395464 -0.399960 -0.449844 0.177971 0.014939 -0.803313 1.026379 -0.106169 -0.258046 -0.413749 -0.296211 -0.564847 -0.313846 -0.563222 -0.269338 0.240145 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 -0.005683 -0.068832 -0.003594 -0.066765 -0.06598 -0.038405 -0.110307 -0.018852 -0.095185 -0.026299 -0.025168 -0.018852 -0.082491 -0.067154 -0.172916 1.651752 -0.299344 -0.487533 0.156858 1.189524 0.731883 0.174370 1.503669 -1.007386 0.649283 1.158867

154807 rows × 112 columns

In [ ]:
df_app_train.shape
Out[ ]:
(154807, 112)
In [ ]:
df_app_test.shape
Out[ ]:
(25345, 111)

Encoding (Train + Test)

With Standardisation

In [ ]:
df_app_train.shape
Out[ ]:
(154807, 112)

Pour les variables catégorielles avec de nombreuses classes, one hot encoding est l'approche la plus sûre car elle n'impose pas de valeurs arbitraires aux catégories. Le seul inconvénient du one hot encoding est que le nombre de caractéristiques (dimensions des données) peut exploser avec des variables catégorielles comportant de nombreuses catégories.

Mettons en œuvre la politique décrite ci-dessus : pour toute variable catégorielle (dtype == object) avec 2 catégories uniques, nous utiliserons l'encodage par label, et pour toute variable catégorielle avec plus de 2 catégories uniques, nous utiliserons l'encodage one-hot.

In [ ]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0
oh_count = 0

# Iterate through the columns
for col in df_app_train:
    if col == 'TARGET' or col == 'SK_ID_CURR':
        pass
    else :
        if df_app_train[col].dtype == 'object':        
        # If 2 or fewer unique categories
            if len(list(df_app_train[col].unique())) <= 2:
                # Train on the training data
                le.fit(df_app_train[col])
                # Transform both training and testing data
                df_app_train[col] = le.transform(df_app_train[col])
                df_app_test[col] = le.transform(df_app_test[col])
            
            # Keep track of how many columns were label encoded
                le_count += 1
                
            else :
                #else : one hot encoding
                df_app_train = pd.get_dummies(df_app_train, prefix=[col], columns=[col])
                df_app_test = pd.get_dummies(df_app_test, prefix=[col], columns=[col])
                oh_count += 1
                
print('%d columns were label encoded.' % le_count)
print('Training Features shape: ', df_app_train.shape)
print('Testing Features shape: ', df_app_test.shape)
print('%d columns were one hot encoded.' % oh_count)
5 columns were label encoded.
Training Features shape:  (154807, 217)
Testing Features shape:  (25345, 211)
10 columns were one hot encoded.

Il doit y avoir les mêmes caractéristiques (colonnes) dans les données de formation et de test. L'encodage a créé plus de colonnes dans les données d'apprentissage car certaines variables catégorielles avaient plus de catégories non représentées que dans les données de test. Pour supprimer les colonnes dans les données d'apprentissage qui ne sont pas dans les données de test, nous devons aligner les cadres de données. Tout d'abord, nous extrayons la colonne cible des données de formation (car elle ne figure pas dans les données de test mais nous devons conserver cette information). Lorsque nous effectuons l'alignement, nous devons nous assurer de définir axis = 1 pour aligner les cadres de données sur les colonnes et non sur les lignes.

In [ ]:
train_labels = df_app_train['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
df_app_train, df_app_test = df_app_train.align(df_app_test, join = 'inner', axis = 1)

# Add the target back in
df_app_train['TARGET'] = train_labels

print('Training Features shape: ', df_app_train.shape)
print('Testing Features shape: ', df_app_test.shape)
Training Features shape:  (154807, 212)
Testing Features shape:  (25345, 211)
In [ ]:
df_app_train
Out[ ]:
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY EMERGENCYSTATE_MODE CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG LIVINGAREA_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE LANDAREA_MODE LIVINGAREA_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI LIVINGAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 ... ORGANIZATION_TYPE_Agriculture ORGANIZATION_TYPE_Bank ORGANIZATION_TYPE_Business_Entity_Type_1 ORGANIZATION_TYPE_Business_Entity_Type_2 ORGANIZATION_TYPE_Business_Entity_Type_3 ORGANIZATION_TYPE_Cleaning ORGANIZATION_TYPE_Construction ORGANIZATION_TYPE_Culture ORGANIZATION_TYPE_Electricity ORGANIZATION_TYPE_Emergency ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Hotel ORGANIZATION_TYPE_Housing ORGANIZATION_TYPE_Industry:_type_1 ORGANIZATION_TYPE_Industry:_type_10 ORGANIZATION_TYPE_Industry:_type_11 ORGANIZATION_TYPE_Industry:_type_12 ORGANIZATION_TYPE_Industry:_type_13 ORGANIZATION_TYPE_Industry:_type_2 ORGANIZATION_TYPE_Industry:_type_3 ORGANIZATION_TYPE_Industry:_type_4 ORGANIZATION_TYPE_Industry:_type_5 ORGANIZATION_TYPE_Industry:_type_6 ORGANIZATION_TYPE_Industry:_type_7 ORGANIZATION_TYPE_Industry:_type_9 ORGANIZATION_TYPE_Insurance ORGANIZATION_TYPE_Kindergarten ORGANIZATION_TYPE_Legal_Services ORGANIZATION_TYPE_Medicine ORGANIZATION_TYPE_Military ORGANIZATION_TYPE_Mobile ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Police ORGANIZATION_TYPE_Postal ORGANIZATION_TYPE_Realtor ORGANIZATION_TYPE_Religion ORGANIZATION_TYPE_Restaurant ORGANIZATION_TYPE_School ORGANIZATION_TYPE_Security ORGANIZATION_TYPE_Security_Ministries ORGANIZATION_TYPE_Self-employed ORGANIZATION_TYPE_Services ORGANIZATION_TYPE_Telecom ORGANIZATION_TYPE_Trade:_type_1 ORGANIZATION_TYPE_Trade:_type_2 ORGANIZATION_TYPE_Trade:_type_3 ORGANIZATION_TYPE_Trade:_type_4 ORGANIZATION_TYPE_Trade:_type_5 ORGANIZATION_TYPE_Trade:_type_6 ORGANIZATION_TYPE_Trade:_type_7 ORGANIZATION_TYPE_Transport:_type_1 ORGANIZATION_TYPE_Transport:_type_2 ORGANIZATION_TYPE_Transport:_type_3 ORGANIZATION_TYPE_Transport:_type_4 ORGANIZATION_TYPE_University HOUSETYPE_MODE_block_of_flats HOUSETYPE_MODE_specific_housing HOUSETYPE_MODE_terraced_house WALLSMATERIAL_MODE_Block WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone,_brick WALLSMATERIAL_MODE_Wooden MOST_CREDIT_TYPE_Another_type_of_loan MOST_CREDIT_TYPE_Car_loan MOST_CREDIT_TYPE_Consumer_credit MOST_CREDIT_TYPE_Credit_card MOST_CREDIT_TYPE_Loan_for_business_development MOST_CREDIT_TYPE_Loan_for_working_capital_replenishment MOST_CREDIT_TYPE_Microloan MOST_CREDIT_TYPE_Mortgage MOST_CREDIT_TYPE_Unknown_type_of_loan TARGET
0 100002 0 1 0 1 0 -0.573222 0.069058 -0.513860 -0.222911 -0.544493 -0.226831 -1.521797 0.784979 0.435145 0.592401 0.002542 0.458151 -0.46321 0.047807 1.490005 -0.267116 -1.234433 0.032789 0.100683 -0.691452 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 -1.486774 -2.015710 -0.865735 -0.653469 -0.119336 -0.597191 -0.811059 -0.994768 -0.380311 -0.809814 -0.419175 -0.833449 -0.610656 -0.095632 -0.573400 -0.758313 -0.973308 -0.349750 -0.779994 -0.395464 -0.860099 -0.649427 -0.117949 -0.591143 -0.803313 -0.989141 -0.378838 -0.805722 -0.413749 -0.823533 0.251202 4.215860 0.259889 5.334693 -0.171217 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1.0
1 100003 0 0 0 0 0 -0.573222 0.282931 1.597394 0.500488 1.467988 -1.206771 0.151362 0.546259 1.116256 1.801065 0.002542 0.458151 -0.46321 0.047807 1.490005 -0.267116 -0.108965 -1.821758 -1.805455 -0.390994 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.453341 -1.915166 -0.204086 -0.450753 0.124243 0.008392 -1.157125 0.449491 -0.688567 -0.483594 -0.274421 -0.207225 -0.418553 0.123118 0.047529 -1.101191 0.480237 -0.668697 -0.460213 -0.395464 -0.198015 -0.446030 0.122137 0.014939 -1.148270 0.450812 -0.688457 -0.479085 -0.267304 -0.299918 -0.156823 -0.313846 -0.151666 -0.269338 0.191540 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0.0
2 100016 0 0 0 1 0 -0.573222 -0.358688 -1.289258 -1.460739 -1.277362 0.577670 -0.610539 -0.116181 1.358324 -0.139141 0.002542 0.458151 2.15885 0.047807 1.490005 -0.267116 -0.108965 0.032789 0.100683 -0.691452 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.954416 -1.814623 -0.328610 -0.169484 0.048715 -0.597191 0.572202 -0.416788 -0.682118 -0.275504 -0.419175 -0.285503 -0.126060 0.055288 -0.573400 0.612204 -0.391611 -0.655888 -0.230258 -0.395464 -0.322501 -0.163817 0.047692 -0.591143 0.575514 -0.412883 -0.682086 -0.269679 -0.413749 -0.394447 -0.564847 -0.313846 -0.563222 -0.269338 -1.636472 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
3 100017 0 1 1 0 0 0.890172 0.140349 0.704634 0.057687 0.351236 -0.367401 -0.462328 -0.250921 1.266477 -1.251984 0.002542 0.458151 -0.46321 0.047807 -0.671139 -0.267116 1.016503 0.032789 0.100683 0.209923 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.154523 1.386615 0.274495 0.111785 0.039274 0.613975 -0.119930 0.737788 0.344542 0.286976 -0.419175 0.331402 0.166433 0.046810 0.667688 -0.073551 0.770388 0.386773 0.350019 -0.395464 0.282411 0.118396 0.038386 0.621021 -0.114400 0.738250 0.349977 0.294105 -0.413749 0.351589 -0.564847 -0.313846 -0.563222 -0.269338 1.168377 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
4 100018 0 0 0 1 0 -0.573222 0.026283 0.359971 0.308390 0.304704 -0.791613 -0.348478 0.973010 1.274223 0.634695 0.002542 0.458151 -0.46321 0.047807 -0.671139 -0.267116 -0.108965 0.032789 -1.805455 -0.991910 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.563552 0.811138 2.152576 0.570430 0.377263 2.430723 0.226136 3.048325 1.411184 2.446935 1.059385 2.250145 0.632438 0.350346 2.529705 0.269326 3.095780 1.470422 2.574112 1.154918 2.163543 0.578582 0.371529 2.439266 0.230557 3.041898 1.422813 2.459752 1.082916 2.570233 -0.564847 -0.313846 -0.563222 -0.269338 0.950249 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
154802 456251 0 1 0 0 0 -0.573222 -0.073524 -0.875448 -0.034957 -0.870213 0.656785 -1.552493 0.958713 -0.894985 0.683596 0.002542 0.458151 -0.46321 0.047807 -0.671139 -0.267116 -1.234433 -1.821758 -1.805455 0.810840 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.774014 0.295073 0.782813 0.002825 0.171448 1.068162 -0.465996 2.615186 -0.090112 0.803112 1.198231 -0.128947 -0.872165 0.006112 0.047529 -1.101191 1.642235 -0.712248 -0.191633 -0.212636 0.790501 0.009070 0.168666 1.075582 -0.459356 2.610051 -0.085783 0.812250 1.223503 1.724108 -0.564847 -0.313846 -0.563222 -0.269338 0.849483 -0.006725 -1.514819 -0.00843 -0.124743 -0.305616 -0.014821 3.224716 -0.069586 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0.0
154803 456252 0 0 0 1 0 -0.573222 -0.344430 -0.840098 -1.058193 -0.870213 0.181777 1.069950 -0.706050 0.230425 -0.709440 0.002542 -2.182687 -0.46321 0.047807 1.490005 -0.267116 -1.234433 0.032789 0.100683 -1.292369 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 -2.280302 -0.645935 -0.865735 -0.569849 -0.109895 -0.597191 -0.465996 -0.994768 -0.109458 -0.748932 -0.419175 -0.833449 -0.526379 -0.087154 -0.573400 -0.416429 -0.973308 -0.074355 -0.718014 -0.395464 -0.860099 -0.565526 -0.108644 -0.591143 -0.459356 -0.989141 -0.106169 -0.744869 -0.413749 -0.763294 -0.564847 -0.313846 -0.563222 -0.269338 1.173119 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0.0
154804 456253 0 0 0 1 0 -0.573222 -0.087782 0.131407 0.124285 0.060415 -1.112951 -0.260742 -2.370813 -0.419425 -1.409923 0.002542 0.458151 -0.46321 0.047807 -0.671139 3.743692 -1.234433 1.887336 2.006822 -0.991910 -0.110786 -0.220085 -0.2025 -0.193282 2.698927 2.958003 -0.013867 -1.586943 -0.137177 -0.028850 0.058156 -0.597191 0.572202 -0.416788 -0.109458 7.449279 -0.419175 -0.089808 0.022665 0.063767 -0.573400 0.612204 -0.391611 -0.074355 7.725638 -0.395464 -0.130700 -0.022710 0.056997 -0.591143 0.575514 -0.412883 -0.106169 7.473853 -0.413749 6.424595 1.883299 -0.313846 1.906110 -0.269338 -1.089965 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
154805 456254 0 0 0 1 0 -0.573222 -0.030750 -0.600725 -0.518604 -0.625923 -1.092980 -0.949111 -1.012575 0.735586 1.378132 0.002542 0.458151 -0.46321 0.047807 -0.671139 -0.267116 -0.108965 0.032789 0.100683 -0.991910 -0.110786 -0.220085 -0.2025 5.173787 2.698927 -0.338066 -0.130280 0.798280 -0.980037 -0.241702 -0.026814 -0.294400 -0.811059 -1.283066 -0.109458 -0.927036 -0.419175 -0.950866 -0.192987 -0.010845 -0.262935 -0.758313 -1.263459 -0.074355 -0.901259 -0.395464 -0.975365 -0.236277 -0.026754 -0.288102 -0.803313 -1.276579 -0.106169 -0.922953 -0.413749 -0.881918 -0.564847 -0.313846 -0.563222 -0.269338 0.791394 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1.0
154806 456255 0 0 0 0 0 -0.573222 -0.073524 0.125065 1.383128 0.293071 1.533914 0.172208 0.514198 0.025704 1.722426 0.002542 0.458151 2.15885 0.047807 1.490005 -0.267116 -0.108965 -1.821758 -1.805455 2.313132 -0.110786 -0.220085 -0.2025 -0.193282 2.698927 2.958003 0.919464 -2.153016 -0.405740 -0.454554 0.180889 0.008392 -0.811059 1.026778 -0.109458 -0.263691 -0.419175 -0.363781 -0.408638 0.173990 0.047529 -0.758313 1.061236 -0.074355 -0.217682 -0.395464 -0.399960 -0.449844 0.177971 0.014939 -0.803313 1.026379 -0.106169 -0.258046 -0.413749 -0.296211 -0.564847 -0.313846 -0.563222 -0.269338 0.240145 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0

154807 rows × 212 columns

Without Standardisation

In [ ]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0
oh_count = 0

# Iterate through the columns
for col in app_train_final:
    if col == 'TARGET' or col == 'SK_ID_CURR':
        pass
    else :
        if app_train_final[col].dtype == 'object':        
        # If 2 or fewer unique categories
            if len(list(app_train_final[col].unique())) <= 2:
                # Train on the training data
                le.fit(app_train_final[col])
                # Transform both training and testing data
                app_train_final[col] = le.transform(app_train_final[col])
                app_test_final[col] = le.transform(app_test_final[col])
            
            # Keep track of how many columns were label encoded
                le_count += 1
                
            else :
                #else : one hot encoding
                app_train_final = pd.get_dummies(app_train_final, prefix=[col], columns=[col])
                app_test_final = pd.get_dummies(app_test_final, prefix=[col], columns=[col])
                oh_count += 1
                
print('%d columns were label encoded.' % le_count)
print('Training Features shape: ', app_train_final.shape)
print('Testing Features shape: ', app_test_final.shape)
print('%d columns were one hot encoded.' % oh_count)
5 columns were label encoded.
Training Features shape:  (154807, 217)
Testing Features shape:  (25345, 211)
10 columns were one hot encoded.
In [ ]:
train_labels = app_train_final['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
app_train_final, app_test_final = app_train_final.align(app_test_final, join = 'inner', axis = 1)

# Add the target back in
app_train_final['TARGET'] = train_labels

print('Training Features shape: ', app_train_final.shape)
print('Testing Features shape: ', app_test_final.shape)
Training Features shape:  (154807, 212)
Testing Features shape:  (25345, 211)
In [ ]:
app_train_final
Out[ ]:
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG LIVINGAREA_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE LANDAREA_MODE LIVINGAREA_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI LIVINGAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 ... ORGANIZATION_TYPE_Agriculture ORGANIZATION_TYPE_Bank ORGANIZATION_TYPE_Business_Entity_Type_1 ORGANIZATION_TYPE_Business_Entity_Type_2 ORGANIZATION_TYPE_Business_Entity_Type_3 ORGANIZATION_TYPE_Cleaning ORGANIZATION_TYPE_Construction ORGANIZATION_TYPE_Culture ORGANIZATION_TYPE_Electricity ORGANIZATION_TYPE_Emergency ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Hotel ORGANIZATION_TYPE_Housing ORGANIZATION_TYPE_Industry:_type_1 ORGANIZATION_TYPE_Industry:_type_10 ORGANIZATION_TYPE_Industry:_type_11 ORGANIZATION_TYPE_Industry:_type_12 ORGANIZATION_TYPE_Industry:_type_13 ORGANIZATION_TYPE_Industry:_type_2 ORGANIZATION_TYPE_Industry:_type_3 ORGANIZATION_TYPE_Industry:_type_4 ORGANIZATION_TYPE_Industry:_type_5 ORGANIZATION_TYPE_Industry:_type_6 ORGANIZATION_TYPE_Industry:_type_7 ORGANIZATION_TYPE_Industry:_type_9 ORGANIZATION_TYPE_Insurance ORGANIZATION_TYPE_Kindergarten ORGANIZATION_TYPE_Legal_Services ORGANIZATION_TYPE_Medicine ORGANIZATION_TYPE_Military ORGANIZATION_TYPE_Mobile ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Police ORGANIZATION_TYPE_Postal ORGANIZATION_TYPE_Realtor ORGANIZATION_TYPE_Religion ORGANIZATION_TYPE_Restaurant ORGANIZATION_TYPE_School ORGANIZATION_TYPE_Security ORGANIZATION_TYPE_Security_Ministries ORGANIZATION_TYPE_Self-employed ORGANIZATION_TYPE_Services ORGANIZATION_TYPE_Telecom ORGANIZATION_TYPE_Trade:_type_1 ORGANIZATION_TYPE_Trade:_type_2 ORGANIZATION_TYPE_Trade:_type_3 ORGANIZATION_TYPE_Trade:_type_4 ORGANIZATION_TYPE_Trade:_type_5 ORGANIZATION_TYPE_Trade:_type_6 ORGANIZATION_TYPE_Trade:_type_7 ORGANIZATION_TYPE_Transport:_type_1 ORGANIZATION_TYPE_Transport:_type_2 ORGANIZATION_TYPE_Transport:_type_3 ORGANIZATION_TYPE_Transport:_type_4 ORGANIZATION_TYPE_University HOUSETYPE_MODE_block_of_flats HOUSETYPE_MODE_specific_housing HOUSETYPE_MODE_terraced_house WALLSMATERIAL_MODE_Block WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone,_brick WALLSMATERIAL_MODE_Wooden MOST_CREDIT_TYPE_Another_type_of_loan MOST_CREDIT_TYPE_Car_loan MOST_CREDIT_TYPE_Consumer_credit MOST_CREDIT_TYPE_Credit_card MOST_CREDIT_TYPE_Loan_for_business_development MOST_CREDIT_TYPE_Loan_for_working_capital_replenishment MOST_CREDIT_TYPE_Microloan MOST_CREDIT_TYPE_Mortgage MOST_CREDIT_TYPE_Unknown_type_of_loan TARGET
0 100002 0 202500.0 406597.5 24700.5 351000.0 0.018801 9461 -637.0 -3648.0 -2120 1 1 0 1 1 0 1.0 2 2 10 0 0 0 0 0 0 0.262949 0.139376 0.0247 0.0369 0.9722 0.00 0.0690 0.0833 0.0369 0.0190 0.0000 0.0252 0.0383 0.9722 0.0000 0.0690 0.0833 0.0377 0.0198 0.0000 0.0250 0.0369 0.9722 0.00 0.0690 0.0833 0.0375 0.0193 0.0000 0.0149 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1.0
1 100003 0 270000.0 1293502.5 35698.5 1129500.0 0.003541 16765 -1188.0 -1186.0 -291 1 1 0 1 1 0 2.0 1 1 11 0 0 0 0 0 0 0.622246 0.158014 0.0959 0.0529 0.9851 0.08 0.0345 0.2917 0.0130 0.0549 0.0098 0.0924 0.0538 0.9851 0.0806 0.0345 0.2917 0.0128 0.0554 0.0000 0.0968 0.0529 0.9851 0.08 0.0345 0.2917 0.0132 0.0558 0.0100 0.0714 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0.0
2 100016 0 67500.0 80865.0 5881.5 67500.0 0.031329 13439 -2717.0 -311.0 -3227 1 1 1 1 1 0 2.0 2 2 10 0 0 0 0 0 0 0.715042 0.176653 0.0825 0.0751 0.9811 0.00 0.2069 0.1667 0.0135 0.0778 0.0000 0.0840 0.0774 0.9811 0.0000 0.2069 0.1667 0.0138 0.0810 0.0000 0.0833 0.0751 0.9811 0.00 0.2069 0.1667 0.0137 0.0792 0.0000 0.0612 0.0 0.0 0.0 0.0 -2370.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
3 100017 1 225000.0 918468.0 28966.5 697500.0 0.016612 14086 -3028.0 -643.0 -4911 1 1 0 1 0 0 3.0 2 2 13 0 0 0 0 0 0 0.566907 0.770087 0.1474 0.0973 0.9806 0.16 0.1379 0.3333 0.0931 0.1397 0.0000 0.1502 0.1010 0.9806 0.1611 0.1379 0.3333 0.0952 0.1456 0.0000 0.1489 0.0973 0.9806 0.16 0.1379 0.3333 0.0947 0.1422 0.0000 0.1417 0.0 0.0 0.0 0.0 -4.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
4 100018 0 189000.0 773680.5 32778.0 679500.0 0.010006 14583 -203.0 -615.0 -2056 1 1 0 1 0 0 2.0 2 1 9 0 0 0 0 0 0 0.642656 0.663407 0.3495 0.1335 0.9985 0.40 0.1724 0.6667 0.1758 0.3774 0.1001 0.3561 0.1386 0.9985 0.4028 0.1724 0.6667 0.1798 0.3932 0.1060 0.3529 0.1335 0.9985 0.40 0.1724 0.6667 0.1789 0.3842 0.1022 0.3811 0.0 0.0 0.0 0.0 -188.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
154802 456251 0 157500.0 254700.0 27558.0 225000.0 0.032561 9327 -236.0 -8456.0 -1982 1 1 0 1 0 0 1.0 1 1 15 0 0 0 0 0 0 0.681632 0.567741 0.2021 0.0887 0.9876 0.22 0.1034 0.6042 0.0594 0.1965 0.1095 0.1008 0.0172 0.9782 0.0806 0.0345 0.4583 0.0094 0.0853 0.0125 0.2040 0.0887 0.9876 0.22 0.1034 0.6042 0.0605 0.2001 0.1118 0.2898 0.0 0.0 0.0 0.0 -273.0 0 0 0 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0.0
154803 456252 0 72000.0 269550.0 12001.5 225000.0 0.025164 20775 -4078.5 -4388.0 -4090 1 0 0 1 1 0 1.0 2 2 8 0 0 0 0 0 0 0.115992 0.393300 0.0247 0.0435 0.9727 0.00 0.1034 0.0833 0.0579 0.0257 0.0000 0.0252 0.0451 0.9727 0.0000 0.1034 0.0833 0.0592 0.0267 0.0000 0.0250 0.0435 0.9727 0.00 0.1034 0.0833 0.0589 0.0261 0.0000 0.0214 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0.0
154804 456253 0 153000.0 677664.0 29979.0 585000.0 0.005002 14966 -7921.0 -6737.0 -5150 1 1 0 1 0 1 1.0 3 3 9 0 0 0 0 1 1 0.535722 0.218859 0.1031 0.0862 0.9816 0.00 0.2069 0.1667 0.0579 0.9279 0.0000 0.1050 0.0894 0.9816 0.0000 0.2069 0.1667 0.0592 0.9667 0.0000 0.1041 0.0862 0.9816 0.00 0.2069 0.1667 0.0589 0.9445 0.0000 0.7970 6.0 0.0 6.0 0.0 -1909.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
154805 456254 0 171000.0 370107.0 20205.0 319500.0 0.005313 11961 -4786.0 -2562.0 -931 1 1 0 1 0 0 2.0 2 2 9 0 0 0 1 1 0 0.514163 0.661024 0.0124 0.0694 0.9771 0.04 0.0690 0.0417 0.0579 0.0061 0.0000 0.0126 0.0720 0.9772 0.0403 0.0690 0.0417 0.0592 0.0063 0.0000 0.0125 0.0694 0.9771 0.04 0.0690 0.0417 0.0589 0.0062 0.0000 0.0086 0.0 0.0 0.0 0.0 -322.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1.0
154806 456255 0 157500.0 675000.0 49117.5 675000.0 0.046220 16856 -1262.0 -5128.0 -410 1 1 1 1 1 0 2.0 1 1 20 0 0 0 0 1 1 0.708569 0.113922 0.0742 0.0526 0.9881 0.08 0.0690 0.3750 0.0579 0.0791 0.0000 0.0756 0.0546 0.9881 0.0806 0.0690 0.3750 0.0592 0.0824 0.0000 0.0749 0.0526 0.9881 0.08 0.0690 0.3750 0.0589 0.0805 0.0000 0.0718 0.0 0.0 0.0 0.0 -787.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0

154807 rows × 212 columns

In [ ]:
app_train_final.to_csv("train_imputed_without_standardisation.csv", index=False)
app_test_final.to_csv("test_imputed_without_standardisation.csv", index=False)

Sauvegarde CSVs

In [ ]:
df_app_train.to_csv("df_train_imputed.csv", index=False)
df_app_test.to_csv("df_test_imputed.csv", index=False)
In [ ]:
app_train_reduced.to_csv("real_data_clean_train.csv", index=False)
app_test_reduced.to_csv("real_data_clean_test.csv", index=False)
In [ ]:
app_train_reduced
Out[ ]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG LIVINGAREA_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE LANDAREA_MODE LIVINGAREA_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI LIVINGAREA_MEDI NONLIVINGAREA_MEDI HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR AGE DAYS_EMPLOYED_ANOM PREVIOUS_APPLICATION_COUNT MOST_CREDIT_TYPE PREVIOUS_LOANS_COUNT CREDIT_PERCENT_INCOME ANNUITY_CREDIT_PERCENT_INCOME CREDIT_REFUND_TIME DAYS_EMPLOYED_PERCENT EXT_SOURCE_1
0 100002 1.0 Cash_loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary_/_secondary_special Single_/_not_married House_/_apartment 0.018801 9461 -637.0 -3648.0 -2120 1 1 0 1 1 0 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business_Entity_Type_3 0.262949 0.139376 0.0247 0.0369 0.9722 0.00 0.0690 0.0833 0.0369 0.0190 0.0000 0.0252 0.0383 0.9722 0.0000 0.0690 0.0833 0.0377 0.0198 0.0000 0.0250 0.0369 0.9722 0.00 0.0690 0.0833 0.0375 0.0193 0.0000 block_of_flats 0.0149 Stone,_brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 26 False 8.0 Consumer_credit 1.0 2.007889 0.121978 16.461104 -0.067329 0.083037
1 100003 0.0 Cash_loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State_servant Higher_education Married House_/_apartment 0.003541 16765 -1188.0 -1186.0 -291 1 1 0 1 1 0 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.622246 NaN 0.0959 0.0529 0.9851 0.08 0.0345 0.2917 0.0130 0.0549 0.0098 0.0924 0.0538 0.9851 0.0806 0.0345 0.2917 0.0128 0.0554 0.0000 0.0968 0.0529 0.9851 0.08 0.0345 0.2917 0.0132 0.0558 0.0100 block_of_flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 46 False 4.0 Credit_card 3.0 4.790750 0.132217 36.234085 -0.070862 0.311267
12 100016 0.0 Cash_loans F N Y 0 67500.0 80865.0 5881.5 67500.0 Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment 0.031329 13439 -2717.0 -311.0 -3227 1 1 1 1 1 0 2.0 2 2 FRIDAY 10 0 0 0 0 0 0 Business_Entity_Type_2 0.715042 0.176653 0.0825 NaN 0.9811 0.00 0.2069 0.1667 0.0135 0.0778 0.0000 0.0840 NaN 0.9811 0.0000 0.2069 0.1667 0.0138 0.0810 0.0000 0.0833 NaN 0.9811 0.00 0.2069 0.1667 0.0137 0.0792 0.0000 block_of_flats 0.0612 NaN No 0.0 0.0 0.0 0.0 -2370.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 1.0 0.0 0.0 37 False 7.0 Consumer_credit 4.0 1.198000 0.087133 13.749044 -0.202173 0.464831
13 100017 0.0 Cash_loans M Y N 1 225000.0 918468.0 28966.5 697500.0 Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment 0.016612 14086 -3028.0 -643.0 -4911 1 1 0 1 0 0 3.0 2 2 THURSDAY 13 0 0 0 0 0 0 Self-employed 0.566907 0.770087 0.1474 0.0973 0.9806 0.16 0.1379 0.3333 0.0931 0.1397 0.0000 0.1502 0.1010 0.9806 0.1611 0.1379 0.3333 0.0952 0.1456 0.0000 0.1489 0.0973 0.9806 0.16 0.1379 0.3333 0.0947 0.1422 0.0000 block_of_flats 0.1417 Panel No 0.0 0.0 0.0 0.0 -4.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 39 False 6.0 Consumer_credit 2.0 4.082080 0.128740 31.707938 -0.214965 NaN
14 100018 0.0 Cash_loans F N Y 0 189000.0 773680.5 32778.0 679500.0 Unaccompanied Working Secondary_/_secondary_special Married House_/_apartment 0.010006 14583 -203.0 -615.0 -2056 1 1 0 1 0 0 2.0 2 1 MONDAY 9 0 0 0 0 0 0 Transport:_type_2 0.642656 NaN 0.3495 0.1335 0.9985 0.40 0.1724 0.6667 0.1758 0.3774 0.1001 0.3561 0.1386 0.9985 0.4028 0.1724 0.6667 0.1798 0.3932 0.1060 0.3529 0.1335 0.9985 0.40 0.1724 0.6667 0.1789 0.3842 0.1022 block_of_flats 0.3811 Panel No 0.0 0.0 0.0 0.0 -188.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 40 False NaN NaN 4.0 4.093548 0.173429 23.603652 -0.013920 0.721940
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307506 456251 0.0 Cash_loans M N N 0 157500.0 254700.0 27558.0 225000.0 Unaccompanied Working Secondary_/_secondary_special Separated With_parents 0.032561 9327 -236.0 -8456.0 -1982 1 1 0 1 0 0 1.0 1 1 THURSDAY 15 0 0 0 0 0 0 Services 0.681632 NaN 0.2021 0.0887 0.9876 0.22 0.1034 0.6042 0.0594 0.1965 0.1095 0.1008 0.0172 0.9782 0.0806 0.0345 0.4583 0.0094 0.0853 0.0125 0.2040 0.0887 0.9876 0.22 0.1034 0.6042 0.0605 0.2001 0.1118 block_of_flats 0.2898 Stone,_brick No 0.0 0.0 0.0 0.0 -273.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 26 False NaN NaN 1.0 1.617143 0.174971 9.242325 -0.025303 0.145570
307507 456252 0.0 Cash_loans F N Y 0 72000.0 269550.0 12001.5 225000.0 Unaccompanied Pensioner Secondary_/_secondary_special Widow House_/_apartment 0.025164 20775 NaN -4388.0 -4090 1 0 0 1 1 0 1.0 2 2 MONDAY 8 0 0 0 0 0 0 NaN 0.115992 NaN 0.0247 0.0435 0.9727 0.00 0.1034 0.0833 0.0579 0.0257 0.0000 0.0252 0.0451 0.9727 0.0000 0.1034 0.0833 0.0592 0.0267 0.0000 0.0250 0.0435 0.9727 0.00 0.1034 0.0833 0.0589 0.0261 0.0000 block_of_flats 0.0214 Stone,_brick No 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN 57 True NaN NaN 1.0 3.743750 0.166687 22.459693 NaN NaN
307508 456253 0.0 Cash_loans F N Y 0 153000.0 677664.0 29979.0 585000.0 Unaccompanied Working Higher_education Separated House_/_apartment 0.005002 14966 -7921.0 -6737.0 -5150 1 1 0 1 0 1 1.0 3 3 THURSDAY 9 0 0 0 0 1 1 School 0.535722 0.218859 0.1031 0.0862 0.9816 0.00 0.2069 0.1667 NaN 0.9279 0.0000 0.1050 0.0894 0.9816 0.0000 0.2069 0.1667 NaN 0.9667 0.0000 0.1041 0.0862 0.9816 0.00 0.2069 0.1667 NaN 0.9445 0.0000 block_of_flats 0.7970 Panel No 6.0 0.0 6.0 0.0 -1909.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0 0.0 0.0 1.0 0.0 1.0 41 False 4.0 Consumer_credit 2.0 4.429176 0.195941 22.604623 -0.529266 0.744026
307509 456254 1.0 Cash_loans F N Y 0 171000.0 370107.0 20205.0 319500.0 Unaccompanied Commercial_associate Secondary_/_secondary_special Married House_/_apartment 0.005313 11961 -4786.0 -2562.0 -931 1 1 0 1 0 0 2.0 2 2 WEDNESDAY 9 0 0 0 1 1 0 Business_Entity_Type_1 0.514163 0.661024 0.0124 NaN 0.9771 NaN 0.0690 0.0417 NaN 0.0061 NaN 0.0126 NaN 0.9772 NaN 0.0690 0.0417 NaN 0.0063 NaN 0.0125 NaN 0.9771 NaN 0.0690 0.0417 NaN 0.0062 NaN block_of_flats 0.0086 Stone,_brick No 0.0 0.0 0.0 0.0 -322.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 33 False 1.0 Consumer_credit 2.0 2.164368 0.118158 18.317595 -0.400134 NaN
307510 456255 0.0 Cash_loans F N N 0 157500.0 675000.0 49117.5 675000.0 Unaccompanied Commercial_associate Higher_education Married House_/_apartment 0.046220 16856 -1262.0 -5128.0 -410 1 1 1 1 1 0 2.0 1 1 THURSDAY 20 0 0 0 0 1 1 Business_Entity_Type_3 0.708569 0.113922 0.0742 0.0526 0.9881 0.08 0.0690 0.3750 NaN 0.0791 0.0000 0.0756 0.0546 0.9881 0.0806 0.0690 0.3750 NaN 0.0824 0.0000 0.0749 0.0526 0.9881 0.08 0.0690 0.3750 NaN 0.0805 0.0000 block_of_flats 0.0718 Panel No 0.0 0.0 0.0 0.0 -787.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 2.0 0.0 1.0 46 False 11.0 Consumer_credit 8.0 4.285714 0.311857 13.742556 -0.074869 0.734460

154807 rows × 113 columns

In [ ]:
df_app_train
Out[ ]:
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY EMERGENCYSTATE_MODE CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG LIVINGAREA_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE LANDAREA_MODE LIVINGAREA_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI LIVINGAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 ... ORGANIZATION_TYPE_Agriculture ORGANIZATION_TYPE_Bank ORGANIZATION_TYPE_Business_Entity_Type_1 ORGANIZATION_TYPE_Business_Entity_Type_2 ORGANIZATION_TYPE_Business_Entity_Type_3 ORGANIZATION_TYPE_Cleaning ORGANIZATION_TYPE_Construction ORGANIZATION_TYPE_Culture ORGANIZATION_TYPE_Electricity ORGANIZATION_TYPE_Emergency ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Hotel ORGANIZATION_TYPE_Housing ORGANIZATION_TYPE_Industry:_type_1 ORGANIZATION_TYPE_Industry:_type_10 ORGANIZATION_TYPE_Industry:_type_11 ORGANIZATION_TYPE_Industry:_type_12 ORGANIZATION_TYPE_Industry:_type_13 ORGANIZATION_TYPE_Industry:_type_2 ORGANIZATION_TYPE_Industry:_type_3 ORGANIZATION_TYPE_Industry:_type_4 ORGANIZATION_TYPE_Industry:_type_5 ORGANIZATION_TYPE_Industry:_type_6 ORGANIZATION_TYPE_Industry:_type_7 ORGANIZATION_TYPE_Industry:_type_9 ORGANIZATION_TYPE_Insurance ORGANIZATION_TYPE_Kindergarten ORGANIZATION_TYPE_Legal_Services ORGANIZATION_TYPE_Medicine ORGANIZATION_TYPE_Military ORGANIZATION_TYPE_Mobile ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Police ORGANIZATION_TYPE_Postal ORGANIZATION_TYPE_Realtor ORGANIZATION_TYPE_Religion ORGANIZATION_TYPE_Restaurant ORGANIZATION_TYPE_School ORGANIZATION_TYPE_Security ORGANIZATION_TYPE_Security_Ministries ORGANIZATION_TYPE_Self-employed ORGANIZATION_TYPE_Services ORGANIZATION_TYPE_Telecom ORGANIZATION_TYPE_Trade:_type_1 ORGANIZATION_TYPE_Trade:_type_2 ORGANIZATION_TYPE_Trade:_type_3 ORGANIZATION_TYPE_Trade:_type_4 ORGANIZATION_TYPE_Trade:_type_5 ORGANIZATION_TYPE_Trade:_type_6 ORGANIZATION_TYPE_Trade:_type_7 ORGANIZATION_TYPE_Transport:_type_1 ORGANIZATION_TYPE_Transport:_type_2 ORGANIZATION_TYPE_Transport:_type_3 ORGANIZATION_TYPE_Transport:_type_4 ORGANIZATION_TYPE_University HOUSETYPE_MODE_block_of_flats HOUSETYPE_MODE_specific_housing HOUSETYPE_MODE_terraced_house WALLSMATERIAL_MODE_Block WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone,_brick WALLSMATERIAL_MODE_Wooden MOST_CREDIT_TYPE_Another_type_of_loan MOST_CREDIT_TYPE_Car_loan MOST_CREDIT_TYPE_Consumer_credit MOST_CREDIT_TYPE_Credit_card MOST_CREDIT_TYPE_Loan_for_business_development MOST_CREDIT_TYPE_Loan_for_working_capital_replenishment MOST_CREDIT_TYPE_Microloan MOST_CREDIT_TYPE_Mortgage MOST_CREDIT_TYPE_Unknown_type_of_loan TARGET
0 100002 0 1 0 1 0 -0.573222 0.069058 -0.513860 -0.222911 -0.544493 -0.226831 -1.521797 0.784979 0.435145 0.592401 0.002542 0.458151 -0.46321 0.047807 1.490005 -0.267116 -1.234433 0.032789 0.100683 -0.691452 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 -1.486774 -2.015710 -0.865735 -0.653469 -0.119336 -0.597191 -0.811059 -0.994768 -0.380311 -0.809814 -0.419175 -0.833449 -0.610656 -0.095632 -0.573400 -0.758313 -0.973308 -0.349750 -0.779994 -0.395464 -0.860099 -0.649427 -0.117949 -0.591143 -0.803313 -0.989141 -0.378838 -0.805722 -0.413749 -0.823533 0.251202 4.215860 0.259889 5.334693 -0.171217 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1.0
1 100003 0 0 0 0 0 -0.573222 0.282931 1.597394 0.500488 1.467988 -1.206771 0.151362 0.546259 1.116256 1.801065 0.002542 0.458151 -0.46321 0.047807 1.490005 -0.267116 -0.108965 -1.821758 -1.805455 -0.390994 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.453341 -1.915166 -0.204086 -0.450753 0.124243 0.008392 -1.157125 0.449491 -0.688567 -0.483594 -0.274421 -0.207225 -0.418553 0.123118 0.047529 -1.101191 0.480237 -0.668697 -0.460213 -0.395464 -0.198015 -0.446030 0.122137 0.014939 -1.148270 0.450812 -0.688457 -0.479085 -0.267304 -0.299918 -0.156823 -0.313846 -0.151666 -0.269338 0.191540 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0.0
2 100016 0 0 0 1 0 -0.573222 -0.358688 -1.289258 -1.460739 -1.277362 0.577670 -0.610539 -0.116181 1.358324 -0.139141 0.002542 0.458151 2.15885 0.047807 1.490005 -0.267116 -0.108965 0.032789 0.100683 -0.691452 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.954416 -1.814623 -0.328610 -0.169484 0.048715 -0.597191 0.572202 -0.416788 -0.682118 -0.275504 -0.419175 -0.285503 -0.126060 0.055288 -0.573400 0.612204 -0.391611 -0.655888 -0.230258 -0.395464 -0.322501 -0.163817 0.047692 -0.591143 0.575514 -0.412883 -0.682086 -0.269679 -0.413749 -0.394447 -0.564847 -0.313846 -0.563222 -0.269338 -1.636472 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
3 100017 0 1 1 0 0 0.890172 0.140349 0.704634 0.057687 0.351236 -0.367401 -0.462328 -0.250921 1.266477 -1.251984 0.002542 0.458151 -0.46321 0.047807 -0.671139 -0.267116 1.016503 0.032789 0.100683 0.209923 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.154523 1.386615 0.274495 0.111785 0.039274 0.613975 -0.119930 0.737788 0.344542 0.286976 -0.419175 0.331402 0.166433 0.046810 0.667688 -0.073551 0.770388 0.386773 0.350019 -0.395464 0.282411 0.118396 0.038386 0.621021 -0.114400 0.738250 0.349977 0.294105 -0.413749 0.351589 -0.564847 -0.313846 -0.563222 -0.269338 1.168377 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
4 100018 0 0 0 1 0 -0.573222 0.026283 0.359971 0.308390 0.304704 -0.791613 -0.348478 0.973010 1.274223 0.634695 0.002542 0.458151 -0.46321 0.047807 -0.671139 -0.267116 -0.108965 0.032789 -1.805455 -0.991910 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.563552 0.811138 2.152576 0.570430 0.377263 2.430723 0.226136 3.048325 1.411184 2.446935 1.059385 2.250145 0.632438 0.350346 2.529705 0.269326 3.095780 1.470422 2.574112 1.154918 2.163543 0.578582 0.371529 2.439266 0.230557 3.041898 1.422813 2.459752 1.082916 2.570233 -0.564847 -0.313846 -0.563222 -0.269338 0.950249 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
154802 456251 0 1 0 0 0 -0.573222 -0.073524 -0.875448 -0.034957 -0.870213 0.656785 -1.552493 0.958713 -0.894985 0.683596 0.002542 0.458151 -0.46321 0.047807 -0.671139 -0.267116 -1.234433 -1.821758 -1.805455 0.810840 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 0.774014 0.295073 0.782813 0.002825 0.171448 1.068162 -0.465996 2.615186 -0.090112 0.803112 1.198231 -0.128947 -0.872165 0.006112 0.047529 -1.101191 1.642235 -0.712248 -0.191633 -0.212636 0.790501 0.009070 0.168666 1.075582 -0.459356 2.610051 -0.085783 0.812250 1.223503 1.724108 -0.564847 -0.313846 -0.563222 -0.269338 0.849483 -0.006725 -1.514819 -0.00843 -0.124743 -0.305616 -0.014821 3.224716 -0.069586 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0.0
154803 456252 0 0 0 1 0 -0.573222 -0.344430 -0.840098 -1.058193 -0.870213 0.181777 1.069950 -0.706050 0.230425 -0.709440 0.002542 -2.182687 -0.46321 0.047807 1.490005 -0.267116 -1.234433 0.032789 0.100683 -1.292369 -0.110786 -0.220085 -0.2025 -0.193282 -0.370518 -0.338066 -2.280302 -0.645935 -0.865735 -0.569849 -0.109895 -0.597191 -0.465996 -0.994768 -0.109458 -0.748932 -0.419175 -0.833449 -0.526379 -0.087154 -0.573400 -0.416429 -0.973308 -0.074355 -0.718014 -0.395464 -0.860099 -0.565526 -0.108644 -0.591143 -0.459356 -0.989141 -0.106169 -0.744869 -0.413749 -0.763294 -0.564847 -0.313846 -0.563222 -0.269338 1.173119 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0.0
154804 456253 0 0 0 1 0 -0.573222 -0.087782 0.131407 0.124285 0.060415 -1.112951 -0.260742 -2.370813 -0.419425 -1.409923 0.002542 0.458151 -0.46321 0.047807 -0.671139 3.743692 -1.234433 1.887336 2.006822 -0.991910 -0.110786 -0.220085 -0.2025 -0.193282 2.698927 2.958003 -0.013867 -1.586943 -0.137177 -0.028850 0.058156 -0.597191 0.572202 -0.416788 -0.109458 7.449279 -0.419175 -0.089808 0.022665 0.063767 -0.573400 0.612204 -0.391611 -0.074355 7.725638 -0.395464 -0.130700 -0.022710 0.056997 -0.591143 0.575514 -0.412883 -0.106169 7.473853 -0.413749 6.424595 1.883299 -0.313846 1.906110 -0.269338 -1.089965 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
154805 456254 0 0 0 1 0 -0.573222 -0.030750 -0.600725 -0.518604 -0.625923 -1.092980 -0.949111 -1.012575 0.735586 1.378132 0.002542 0.458151 -0.46321 0.047807 -0.671139 -0.267116 -0.108965 0.032789 0.100683 -0.991910 -0.110786 -0.220085 -0.2025 5.173787 2.698927 -0.338066 -0.130280 0.798280 -0.980037 -0.241702 -0.026814 -0.294400 -0.811059 -1.283066 -0.109458 -0.927036 -0.419175 -0.950866 -0.192987 -0.010845 -0.262935 -0.758313 -1.263459 -0.074355 -0.901259 -0.395464 -0.975365 -0.236277 -0.026754 -0.288102 -0.803313 -1.276579 -0.106169 -0.922953 -0.413749 -0.881918 -0.564847 -0.313846 -0.563222 -0.269338 0.791394 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1.0
154806 456255 0 0 0 0 0 -0.573222 -0.073524 0.125065 1.383128 0.293071 1.533914 0.172208 0.514198 0.025704 1.722426 0.002542 0.458151 2.15885 0.047807 1.490005 -0.267116 -0.108965 -1.821758 -1.805455 2.313132 -0.110786 -0.220085 -0.2025 -0.193282 2.698927 2.958003 0.919464 -2.153016 -0.405740 -0.454554 0.180889 0.008392 -0.811059 1.026778 -0.109458 -0.263691 -0.419175 -0.363781 -0.408638 0.173990 0.047529 -0.758313 1.061236 -0.074355 -0.217682 -0.395464 -0.399960 -0.449844 0.177971 0.014939 -0.803313 1.026379 -0.106169 -0.258046 -0.413749 -0.296211 -0.564847 -0.313846 -0.563222 -0.269338 0.240145 -0.006725 0.660145 -0.00843 -0.124743 -0.305616 -0.014821 -0.310105 -0.069586 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0

154807 rows × 212 columns

In [ ]:
app_train_final
Out[ ]:
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG LIVINGAREA_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE LANDAREA_MODE LIVINGAREA_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI LANDAREA_MEDI LIVINGAREA_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 ... ORGANIZATION_TYPE_Agriculture ORGANIZATION_TYPE_Bank ORGANIZATION_TYPE_Business_Entity_Type_1 ORGANIZATION_TYPE_Business_Entity_Type_2 ORGANIZATION_TYPE_Business_Entity_Type_3 ORGANIZATION_TYPE_Cleaning ORGANIZATION_TYPE_Construction ORGANIZATION_TYPE_Culture ORGANIZATION_TYPE_Electricity ORGANIZATION_TYPE_Emergency ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Hotel ORGANIZATION_TYPE_Housing ORGANIZATION_TYPE_Industry:_type_1 ORGANIZATION_TYPE_Industry:_type_10 ORGANIZATION_TYPE_Industry:_type_11 ORGANIZATION_TYPE_Industry:_type_12 ORGANIZATION_TYPE_Industry:_type_13 ORGANIZATION_TYPE_Industry:_type_2 ORGANIZATION_TYPE_Industry:_type_3 ORGANIZATION_TYPE_Industry:_type_4 ORGANIZATION_TYPE_Industry:_type_5 ORGANIZATION_TYPE_Industry:_type_6 ORGANIZATION_TYPE_Industry:_type_7 ORGANIZATION_TYPE_Industry:_type_9 ORGANIZATION_TYPE_Insurance ORGANIZATION_TYPE_Kindergarten ORGANIZATION_TYPE_Legal_Services ORGANIZATION_TYPE_Medicine ORGANIZATION_TYPE_Military ORGANIZATION_TYPE_Mobile ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Police ORGANIZATION_TYPE_Postal ORGANIZATION_TYPE_Realtor ORGANIZATION_TYPE_Religion ORGANIZATION_TYPE_Restaurant ORGANIZATION_TYPE_School ORGANIZATION_TYPE_Security ORGANIZATION_TYPE_Security_Ministries ORGANIZATION_TYPE_Self-employed ORGANIZATION_TYPE_Services ORGANIZATION_TYPE_Telecom ORGANIZATION_TYPE_Trade:_type_1 ORGANIZATION_TYPE_Trade:_type_2 ORGANIZATION_TYPE_Trade:_type_3 ORGANIZATION_TYPE_Trade:_type_4 ORGANIZATION_TYPE_Trade:_type_5 ORGANIZATION_TYPE_Trade:_type_6 ORGANIZATION_TYPE_Trade:_type_7 ORGANIZATION_TYPE_Transport:_type_1 ORGANIZATION_TYPE_Transport:_type_2 ORGANIZATION_TYPE_Transport:_type_3 ORGANIZATION_TYPE_Transport:_type_4 ORGANIZATION_TYPE_University HOUSETYPE_MODE_block_of_flats HOUSETYPE_MODE_specific_housing HOUSETYPE_MODE_terraced_house WALLSMATERIAL_MODE_Block WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone,_brick WALLSMATERIAL_MODE_Wooden MOST_CREDIT_TYPE_Another_type_of_loan MOST_CREDIT_TYPE_Car_loan MOST_CREDIT_TYPE_Consumer_credit MOST_CREDIT_TYPE_Credit_card MOST_CREDIT_TYPE_Loan_for_business_development MOST_CREDIT_TYPE_Loan_for_working_capital_replenishment MOST_CREDIT_TYPE_Microloan MOST_CREDIT_TYPE_Mortgage MOST_CREDIT_TYPE_Unknown_type_of_loan TARGET
0 100002 0 202500.0 406597.5 24700.5 351000.0 0.018801 9461 -637.0 -3648.0 -2120 1 1 0 1 1 0 1.0 2 2 10 0 0 0 0 0 0 0.262949 0.139376 0.0247 0.0369 0.9722 0.00 0.0690 0.0833 0.0369 0.0190 0.0000 0.0252 0.0383 0.9722 0.0000 0.0690 0.0833 0.0377 0.0198 0.0000 0.0250 0.0369 0.9722 0.00 0.0690 0.0833 0.0375 0.0193 0.0000 0.0149 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1.0
1 100003 0 270000.0 1293502.5 35698.5 1129500.0 0.003541 16765 -1188.0 -1186.0 -291 1 1 0 1 1 0 2.0 1 1 11 0 0 0 0 0 0 0.622246 0.158014 0.0959 0.0529 0.9851 0.08 0.0345 0.2917 0.0130 0.0549 0.0098 0.0924 0.0538 0.9851 0.0806 0.0345 0.2917 0.0128 0.0554 0.0000 0.0968 0.0529 0.9851 0.08 0.0345 0.2917 0.0132 0.0558 0.0100 0.0714 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0.0
2 100016 0 67500.0 80865.0 5881.5 67500.0 0.031329 13439 -2717.0 -311.0 -3227 1 1 1 1 1 0 2.0 2 2 10 0 0 0 0 0 0 0.715042 0.176653 0.0825 0.0751 0.9811 0.00 0.2069 0.1667 0.0135 0.0778 0.0000 0.0840 0.0774 0.9811 0.0000 0.2069 0.1667 0.0138 0.0810 0.0000 0.0833 0.0751 0.9811 0.00 0.2069 0.1667 0.0137 0.0792 0.0000 0.0612 0.0 0.0 0.0 0.0 -2370.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
3 100017 1 225000.0 918468.0 28966.5 697500.0 0.016612 14086 -3028.0 -643.0 -4911 1 1 0 1 0 0 3.0 2 2 13 0 0 0 0 0 0 0.566907 0.770087 0.1474 0.0973 0.9806 0.16 0.1379 0.3333 0.0931 0.1397 0.0000 0.1502 0.1010 0.9806 0.1611 0.1379 0.3333 0.0952 0.1456 0.0000 0.1489 0.0973 0.9806 0.16 0.1379 0.3333 0.0947 0.1422 0.0000 0.1417 0.0 0.0 0.0 0.0 -4.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
4 100018 0 189000.0 773680.5 32778.0 679500.0 0.010006 14583 -203.0 -615.0 -2056 1 1 0 1 0 0 2.0 2 1 9 0 0 0 0 0 0 0.642656 0.663407 0.3495 0.1335 0.9985 0.40 0.1724 0.6667 0.1758 0.3774 0.1001 0.3561 0.1386 0.9985 0.4028 0.1724 0.6667 0.1798 0.3932 0.1060 0.3529 0.1335 0.9985 0.40 0.1724 0.6667 0.1789 0.3842 0.1022 0.3811 0.0 0.0 0.0 0.0 -188.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
154802 456251 0 157500.0 254700.0 27558.0 225000.0 0.032561 9327 -236.0 -8456.0 -1982 1 1 0 1 0 0 1.0 1 1 15 0 0 0 0 0 0 0.681632 0.567741 0.2021 0.0887 0.9876 0.22 0.1034 0.6042 0.0594 0.1965 0.1095 0.1008 0.0172 0.9782 0.0806 0.0345 0.4583 0.0094 0.0853 0.0125 0.2040 0.0887 0.9876 0.22 0.1034 0.6042 0.0605 0.2001 0.1118 0.2898 0.0 0.0 0.0 0.0 -273.0 0 0 0 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0.0
154803 456252 0 72000.0 269550.0 12001.5 225000.0 0.025164 20775 -4078.5 -4388.0 -4090 1 0 0 1 1 0 1.0 2 2 8 0 0 0 0 0 0 0.115992 0.393300 0.0247 0.0435 0.9727 0.00 0.1034 0.0833 0.0579 0.0257 0.0000 0.0252 0.0451 0.9727 0.0000 0.1034 0.0833 0.0592 0.0267 0.0000 0.0250 0.0435 0.9727 0.00 0.1034 0.0833 0.0589 0.0261 0.0000 0.0214 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0.0
154804 456253 0 153000.0 677664.0 29979.0 585000.0 0.005002 14966 -7921.0 -6737.0 -5150 1 1 0 1 0 1 1.0 3 3 9 0 0 0 0 1 1 0.535722 0.218859 0.1031 0.0862 0.9816 0.00 0.2069 0.1667 0.0579 0.9279 0.0000 0.1050 0.0894 0.9816 0.0000 0.2069 0.1667 0.0592 0.9667 0.0000 0.1041 0.0862 0.9816 0.00 0.2069 0.1667 0.0589 0.9445 0.0000 0.7970 6.0 0.0 6.0 0.0 -1909.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0
154805 456254 0 171000.0 370107.0 20205.0 319500.0 0.005313 11961 -4786.0 -2562.0 -931 1 1 0 1 0 0 2.0 2 2 9 0 0 0 1 1 0 0.514163 0.661024 0.0124 0.0694 0.9771 0.04 0.0690 0.0417 0.0579 0.0061 0.0000 0.0126 0.0720 0.9772 0.0403 0.0690 0.0417 0.0592 0.0063 0.0000 0.0125 0.0694 0.9771 0.04 0.0690 0.0417 0.0589 0.0062 0.0000 0.0086 0.0 0.0 0.0 0.0 -322.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1.0
154806 456255 0 157500.0 675000.0 49117.5 675000.0 0.046220 16856 -1262.0 -5128.0 -410 1 1 1 1 1 0 2.0 1 1 20 0 0 0 0 1 1 0.708569 0.113922 0.0742 0.0526 0.9881 0.08 0.0690 0.3750 0.0579 0.0791 0.0000 0.0756 0.0546 0.9881 0.0806 0.0690 0.3750 0.0592 0.0824 0.0000 0.0749 0.0526 0.9881 0.08 0.0690 0.3750 0.0589 0.0805 0.0000 0.0718 0.0 0.0 0.0 0.0 -787.0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0.0

154807 rows × 212 columns

In [ ]:
 
In [ ]: